You are viewing the documentation for a prerelease version.

View Latest

cbimport csv

    +

    Imports CSV data into Couchbase

    SYNOPSIS

    cbimport csv [--cluster <url>] [--bucket <bucket_name>] [--dataset <path>]
                 [--username <username>] [--password <password>] [--generate-key <key_expr>]
                 [--limit-rows <num>] [--skip-rows <num>] [--field-separator <char>]
                 [--cacert <path>] [--no-ssl-verify] [--threads <num>] [--error-log <path>]
                 [--log-file <path>] [--verbose] [--field-delimiter <char>]
                 [--generator-delimiter <char>] [--ignore-fields <fields>]
                 [--scope-collection-exp <scope_collection_expression>]

    DESCRIPTION

    Imports CSV and other forms of separated value type data into Couchbase. By default data files should contain comma separated values, but if for example you are importing data that is tab separated you can use the --field-separator flag to specify that tabs are used instead of commas.

    The cbimport command also supports custom key-generation for each document in the imported file. Key generation is done with a combination of pre-existing fields in a document and custom generator functions supplied by cbimport. See the KEY GENERATION section below for details about key generators.

    OPTIONS

    Below are a list of required and optional parameters for the cbimport-csv command.

    Required

    -c,--cluster <url>

    The hostname of a node in the cluster to import data into. See the HOST FORMATS section below for details about hostname specification formats.

    -u,--username <username>

    The username for cluster authentication. The user must have the appropriate privileges to write to the bucket in which data will be loaded to.

    -p,--password <password>

    The password for cluster authentication. The user must have the appropriate privileges to write to the bucket in which data will be loaded to. Specifying this option without a value will allow the user to type a non-echoed password to stdin.

    -b,--bucket <bucket_name>

    The name of the bucket to import data into.

    -d,--dataset <uri>

    The URI of the dataset to be loaded. cbimport supports loading data from a local file or from a URL. When importing data from a local file the path must be prefixed with file://.

    Optional

    -g,--generate-key <key_expr>

    Specifies a key expression used for generating a key for each document imported. See the Key Generation section below for more information on specifying key generators. If the resulting key is not unique the values will be overridden, resulting in fewer documents than expected being imported. To ensure that the key is unique add #MONO_INCR# or #UUID# to the key generator expression.

    --field-delimiter <char>

    Specifies the character to be used to denote field references in the key generator expression. It defaults to %. See the KEY GENERATION section.

    --generator-delimiter <char>

    Specifies the character to be used to denote generator references in the key generator expression. It defaults to #. See the KEY GENERATION section.

    --field-separator <num>

    Specifies the field separator to use when reading the dataset. By default the separator is a comma. To read tab separated files you can specify a tab in this field. Tabs are specified as \t.

    --limit-rows <num>

    Specifies that the utility should stop loading data after reading a certain amount of rows from the dataset. This option is useful when you have a large dataset and only want to partially load it.

    --skip-rows <num>

    Specifies that the utility should skip some rows before we start importing data. If this flag is used together with the --limit-rows flag then we will import the number of rows specified by --limit-rows after we have skipped the rows specified by --skip-rows.

    --no-ssl-verify

    Skips the SSL verification phase. Specifying this flag will allow a connection using SSL encryption, but will not verify the identity of the server you connect to. You are vulnerable to a man-in-the-middle attack if you use this flag. Either this flag or the --cacert flag must be specified when using an SSL encrypted connection.

    --infer-types

    By default all values in a CSV files are interpreted as strings. If infer types is set then cbimport will look at each value and decide whether it is a string, integer, or boolean value and put the inferred type into the document.

    --omit-empty

    Some values in a CSV row will not contain any data. By default these values are put into the generated JSON document as an empty string. To omit fields that contain empty values specify this flag.

    --cacert <cert_path>

    Specifies a CA certificate that will be used to verify the identity of the server being connecting to. Either this flag or the --no-ssl-verify flag must be specified when using an SSL encrypted connection.

    --scope-collection-exp <scope_collection_expression>

    When importing to a collection aware cluster you may optionally choose to provide a scope/collection expression which will be used to determine which scope/collection to insert the imported document in. This flag closely resembles the behavior/syntax of the --generate-key flag. For example, to use a static scope/collection use --scope-collection-exp scope.collection. To use information from the CSV row, specify the column name between % characters. For example, --scope-collection-exp %scope_column%.%collection_column%. Columns that contain a % character may be escaped using %%. For more information about the accepted format, see the SCOPE/COLLECTION PARSER section.

    -t,--threads <num>

    Specifies the number of concurrent clients to use when importing data. Fewer clients means imports will take longer, but there will be less cluster resources used to complete the import. More clients means faster imports, but at the cost of more cluster resource usage. This parameter defaults to 1 if it is not specified and it is recommended that this parameter is not set to be higher than the number of CPUs on the machine where the import is taking place.

    -e,--errors-log <path>

    Specifies a log file where JSON documents that could not be loaded are written to. A document might not be loaded if a key could not be generated for the document or if the document is not valid JSON. The errors file is written in the "json lines" format (one document per line).

    -l,--log-file <path>

    Specifies a log file for writing debugging information about cbimport execution.

    -v,--verbose

    Specifies that logging should be sent to stdout. If this flag is specified along with the -l/--log-file option then the verbose option is ignored.

    --ignore-fields <fields>

    Specify a comma separated list of field names that will be excluded from the imported document. The field reference syntax is the same as the one used in KEY GENERATORS to refer to fields.

    Unresolved include directive in modules/tools/pages/cbimport-csv.adoc - include::partial$/part-host-formats.adoc[]

    KEY GENERATORS

    Key generators are used in order to generate a key for each document loaded. Keys can be generated by using a combination of characters, the values of a given row in a document, and custom generators. Row substitutions are done by wrapping the column name in "%" and custom generators are wrapped in "#". Below is an example of a key generation expression.

    Given the CSV dataset:

    fname,age
    alice,40
    barry,36

    Key Generator Expression:

    --generate-key key::%fname%::#MONO_INCR#

    The following keys would be generated:

    key::alice::1
    key::barry::2

    In the example above we generate a key using the value in each row of the fname column and a custom generator. To specify that we want to substitute the value in the fname column we put the name of the column fname between two percent signs. This is an example of field substitution and it allows the ability to build keys out of data that is already in the dataset.

    This example also contains a generator function MONO_INCR which will increment by 1 each time the key generator is called. Since this is the first time this key generator was executed it returns 1. If we executed the key generator again it would return 2 and so on. The starting value of the MONO_INCR generator is 1 by default, but it can be changed by specifying a number in brackets after the MONO_INCR generator name. To start generating monotonically incrementing values starting at 100 for example, the generator MONO_INCR[100] would be specified. The cbimport command current contains a monotonic increment generator (MONO_INCR) and a UUID generator (UUID).

    Any text that isn’t wrapped in "%" or "#" is static text and will be in the result of all generated keys. If a key needs to contain a "%" or "#" in static text then they need to be escaped by providing a double "%" or "#" (ex. "%%" or "##"). The delimiter characters can be changed to avoid having to escape them by using the --field-delimiter and --generator-delimiter flags.

    If a key cannot be generated because the field specified in the key generator is not present in the document then the key will be skipped. To see a list of document that were not imported due to failed key generation users can specify the --errors-log <path> parameter to dump a list of all documents that could not be imported to a file.

    SCOPE/COLLECTION PARSER

    Scope/collection parsers are used in order to determine which scope/collection to insert documents into. There are currently two supported parsers text/field.

    Given the CSV dataset:

     product, stock, type, subtype
    apple, 100, produce, fruit

    Scope/collection expression:

    --scope-collection-exp %type%.%subtype%

    The row would be inserted into the 'fruit' collection inside the 'produce' scope.

    Given the CSV dataset:

    fname,age
    alice,40
    barry,36

    Scope/collection expression:

    --scope-collection-exp uk.manchester

    In this case, no fields in the row will be used to determine the scope or collection; all the rows would be inserted into the 'manchester' collection inside the 'uk' scope.

    There is nothing stopping the mixture of text/field expressions the following are all valid expressions.

    uk.%city%
    uk.%city%-5
    %country%.%city%::%town%

    EXAMPLES

    In the examples below we will show examples for importing data from the files below.

    ./data/people.csv
    
      fname,age
      alice,40
      barry,36
    
    ./data/people.tsv
    
      fname  age
      alice  40
      barry  36

    To import data from /data/people.csv using a key containing the fname column and utilizing 4 threads the following command can be run.

    $ cbimport csv -c couchbase://127.0.0.1 -u Administrator -p password \
     -b default -d file:///data/people.csv -g key::%fname% -t 4

    To import data from /data/people.tsv using a key containing the fname column and the UUID generator the following command would be run.

    $ cbimport csv -c couchbase://127.0.0.1 -u Administrator -p password \
     -b default -d file:///data/people.tsv --field-separator $_\t_ \
     -g key::%fname%::#UUID# -t 4

    To import data from /data/list.csv using a key containing the "name" column and then a unique id separated by a # we could use the --generator-delimiter flag to avoid escaping the # sign. An example would be:

    $ cbimport csv -c couchbase://127.0.0.1 -u Administrator -p password \
     -b default -d file:///data/list.csv --generator-delimiter '£' \
    -g key::%name%#£UUID£ -t 4

    If the dataset in not available on the local machine where the command is run, but is available via an HTTP URL we can still import the data using cbimport. If we assume that the data is located at http://data.org/people.csv then we can import the data with the command below.

    $ cbimport csv -c couchbase://127.0.0.1 -u Administrator -p password \
     -b default -d http://data.org/people.csv -g key::%fname%::#UUID# -t 4

    If the CSV dataset contains information which would allow importing into scopes/collections then an command like the one below could be used.

    product, stock, type, subtype
    apple, 100, produce, fruit
    $ cbimport csv -c couchbase://127.0.0.1 -u Administrator -p password
     -b default -d file://data/list.csv -g %product%
     --scope-collection-exp %type%.%subtype%

    This command would place the row into the fruit collection inside the produce scope.

    DISCUSSION

    The cbimport-csv command is used to quickly import data from various files containing CSV, TSV or other separated format data. While importing CSV the cbimport command only utilizes a single reader. As a result importing large dataset may benefit from being partitioned into multiple files and running a separate cbimport process on each file.

    ENVIRONMENT AND CONFIGURATION VARIABLES

    CB_CLUSTER

    Specifies the hostname of the Couchbase cluster to connect to. If the hostname is supplied as a command line argument then this value is overridden.

    CB_USERNAME

    Specifies the username for authentication to a Couchbase cluster. If the username is supplied as a command line argument then this value is overridden.

    CB_PASSWORD

    Specifies the password for authentication to a Couchbase cluster. If the password is supplied as a command line argument then this value is overridden.

    SEE ALSO

    CBIMPORT

    Part of the cbimport suite