cbimport json

Imports JSON data into Couchbase

SYNOPSIS

cbimport json [--cluster <url>] [--bucket <bucket_name>] [--dataset <path>]
              [--format <data_format>][--username <username>] [--password <password>]
              [--generate-key <key_expr>][--cacert <path>][--no-ssl-verify]
              [--threads <num>] [--error-log <path>][--log-file <path>] [--verbose]

DESCRIPTION

Imports JSON data into Couchbase. The cbimport command supports files that have a JSON docment on each line, files that contain a JSON list where each element is a document, and the Couchbase Samples format. The file format can be specified with the --format flag. See the DATASET FORMATS section below for more details on the supported file formats.

The cbimport command also supports custom key-generation for each document in the imported file. Key generation is done with a combination of pre-existing fields in a document and custom generator functions supplied by cbimport. See the KEY GENERATION section below for details about key generators.

OPTIONS

Below are a list of required and optional parameters for the cbimport command.

Required

-c,--cluster <url>

The hostname of a node in the cluster to import data into. See the HOST FORMATS section below for details about hostname specification formats.

-u,--username <username>

The username for cluster authentication. The user must have the appropriate privileges to write to the bucket in which data will be loaded to.

-p,--password <password>

The password for cluster authentication. The user must have the appropriate privileges to write to the bucket in which data will be loaded to. Specifying this option without a value will allow the user to type a non-echoed password to stdin.

-b,--bucket <bucket_name>

The name of the bucket to import data into.

-d,--dataset <uri>

The URI of the dataset to be loaded. cbimport supports loading data from a local files only. When importing data from a local file the path must be prefixed with file://.

-f,--format <format>

The format of the dataset specified (lines, list, sample). See the DATASET FORMATS section below for more details on the formats supported by cbimport.

Optional

-g,--generate-key <key_expr>

Specifies a key expression used for generating a key for each document imported. This parameter is required for list and lines formats, but not for the sample format. See the KEY GENERATION section below for more information on specifying key generators.

--no-ssl-verify

Skips the SSL verification phase. Specifying this flag will allow a connection using SSL encryption, but will not verify the identity of the server you connect to. You are vulnerable to a man-in-the-middle attack if you use this flag. Either this flag or the --cacert flag must be specified when using an SSL encrypted connection.

--cacert <cert_path>

Specifies a CA certificate that will be used to verify the identity of the server being connecting to. Either this flag or the --no-ssl-verify flag must be specified when using an SSL encrypted connection.

--limit-docs <num>

Specifies that the utility should stop loading data after reading a certain amount of docs from the dataset. This option is useful when you have a large dataset and only want to partially load it.

--skip-docs <num>

Specifies that the utility should skip some docs before we start importing data. If this flag is used together with the --limit-rows flag then we will import the number of rows specified by --limit-rows after we ave skipped the rows specified by --skip-rows.

-t,--threads <num>

Specifies the number of concurrent clients to use when importing data. Fewer clients means imports will take longer, but there will be less cluster resources used to complete the import. More clients means faster imports, but at the cost of more cluster resource usage. This parameter defaults to 1 if it is not specified and it is recommended that this parameter is not set to be higher than the number of CPUs on the machine where the import is taking place.

-e,--errors-log <path>

Specifies a log file where JSON documents that could not be loaded are written to. A document might not be loaded if a key could not be generated for the document or if the document is not valid JSON. The errors file is written in the "lines" format (one document per line).

-l,--log-file <path>

Specifies a log file for writing debugging information about cbimport execution.

-v,--verbose

Specifies that logging should be sent to stdout. If this flag is specified along with the -l/--log-file option then the verbose option is ignored.

HOST FORMATS

When specifying a host for the couchbase-cli command the following formats are expected:

  • couchbase://<addr>

  • <addr>:<port>

  • http://<addr>:<port>

It is recommended to use the couchbase://<addr> format for standard installations. The other two formats allow an option to take a port number which is needed for non-default installations where the admin port has been set up on a port other that 8091.

DATASET FORMATS

The cbimport command supports the formats listed below.

LINES

The lines format specifies a file that contains one JSON document on every line in the file. This format is specified by setting the --format option to "lines". Below is an example of a file in lines format.

{"key": "mykey1", "value": "myvalue1"}
{"key": "mykey2", "value": "myvalue2"}
{"key": "mykey3", "value": "myvalue3"}
{"key": "mykey4", "value": "myvalue4"}

LIST

The list format specifies a file which contains a JSON list where each element in the list is a JSON document. The file may only contain a single list, but the list may be specified over multiple lines. This format is specified by setting the --format option to "list". Below is an example of a file in list format.

[
  {
    "key": "mykey1",
    "value": "myvalue1"
  },
  {"key": "mykey2", "value": "myvalue2"},
  {"key": "mykey3", "value": "myvalue3"},
  {"key": "mykey4", "value": "myvalue4"}
]

SAMPLE

The sample format specifies a ZIP file or folder containing multiple documents. This format is intended to load Couchbase sample data sets. Unlike the lines and list formats the sample format may also contains index, view, and full-text index definitions. The folder structure is specified below.

+ (root folder)
  + docs
    key1.json
    key2.json
    ...
  + design_docs
    indexes.json
    views.json

All documents in the samples format are contained in the docs folder and there is one file per document. Each filename in the docs folder is the key name for the JSON document contained in the file. If the filename contains a .json extension then the extension is excluded from the key name during the import. This name can be overridden if the --generate-key option is specified. The docs folder may also contain sub-folders of documents to be imported. Sub-folders can be used to organize large amounts of documents into a more readable catagorized form.

The design_docs folder contains index definitions. The filename indexes.json is reserved for secondary indexes. All other file names are used for view indexes.

KEY GENERATORS

Key generators are used in order to generate a key for each document loaded. Keys can be generated by using a combination of characters, the values of a given field in a document, and custom generators. Field substitutions are done by wrapping the field name in "%" and custom generators are wrapped in "#". Below is an example of a key generation expression.

Given the document:

{
  "name": "alice",
  "age": 40
}

Key Generator Expression:

--generate-key key::%name%::#MONO_INCR#

The following key would be generated:

key::alice::1

In the example above we generate a key using both the value of a field in the document and a custom generator. We use the "name" field to use the value of the name field as part of the key. This is specified by "%name%" which tells the key generator to substitute the value of the field "name" into the key.

This example also contains a generator function MONO_INCR which will increment by 1 each time the key generator is called. Since this is the first time this key generator was executed it returns 1. If we executed the key generator again it would return 2 and so on. The starting value of the MONO_INCR generator is 1 by default, but it can be changed by specifying a number in brackets after the MONO_INCR generator name. To start generating monotonically incrementing values starting at 100 for example, the generator MONO_INCR[100] would be specified. The cbimport command current contains a monotonic increment generator (MONO_INCR) and a UUID generator (UUID).

Any text that isn’t wrapped in "%" or "" is static text and will be in the result of all generated keys. If a key needs to contain a "%" or "" in static text then they need to be escaped by providing a double "%" or "" (ex. "%%" or "#").

If a key cannot be generated because the field specified in the key generator is not present in the document then the key will be skipped. To see a list of document that were not imported due to failed key generation users can specify the --errors-log <path> parameter to dump a list of all documents that could not be imported to a file.

EXAMPLES

In the examples below we will show examples for importing data from the files below.

/data/lines.json
{"name": "alice", "age": 37}
{"name": "bob", "age": 39}
/data/list.json
[
  {"name": "candice", "age": 42},
  {"name": "daniel", "age": 38}
]

To import data from /data/lines.json using a key containing the "name" field and utilizing 4 threads the following command can be run.

$ cbimport json -c couchbase://127.0.0.1 -u Administrator -p password \
 -b default -d file:///data/lines.json -f lines -g key::%name% -t 4

To import data from /data/list.json using a key containing the "name" field and the UUID generator the following command would be run.

$ cbimport json -c couchbase://127.0.0.1 -u Administrator -p password \
 -b default -d file:///data/list.json -f list -g key::%name%::#UUID# -t 4

If the dataset in not available on the local machine where the command is run, but is available via an HTTP URL we can still import the data using cbimport. If we assume that the data is located at http://data.org/list.json and that the dataset is in the JSON list format then we can import the data with the command below.

$ cbimport json -c couchbase://127.0.0.1 -u Administrator -p password \
 -b default -d http://data.org/list.json -f list -g key::%name%::#UUID# -t 4

DISCUSSION

The cbimport-json command is used to quickly import data from various files containing JSON data. While importing JSON the cbimport command only utilizes a single reader. As a result importing large dataset may benefit from being paritioned into multiple files and running a separate cbimport process on each file.

ENVIRONMENT AND CONFIGURATION VARIABLES

(None)

SEE ALSO

CBIMPORT

Part of the cbimport suite