A newer version of this documentation is available.

View Latest

Hadoop connector 1.2

The Couchbase Hadoop Connector allows you to connect to Couchbase Server 2.5, 3.x or 4.x to stream keys into Hadoop Distributed File System (HDFS) or Hive for processing with Hadoop.

If you have used Apache Sqoop before with other databases, using this connector should be straightforward because it uses a similar command line argument structure. Some arguments might seem slightly different because Couchbase has a very different structure than a typical RDBMS.

Getting started

Download the Couchbase Hadoop Connector version 1.2.0 from http://packages.couchbase.com/clients/connectors/couchbase-hadoop-plugin-1.2.0.zip.

The Couchbase Hadoop Connector is supported on Cloudera 5. Cloudera has certified the Couchbase Hadoop Connector 1.2 release for Cloudera 5.

The Couchbase Hadoop Connector is supported on Hortonworks Data Platform (HDP) 2.2. Hortonworks has certified the Couchbase Hadoop Connector 1.2 release for HDP 2.2.

Installation

You can install the Couchbase Hadoop Connector either by running a script or by manually copying the files to specified directories within your Sqoop installation. The distribution package contains a set of files that need to be copied into your Sqoop installation and a script that copies the files for you if you provide the path to the Sqoop installation.

The following table describes the files in the Couchbase Hadoop Connector distribution and lists where each file is installed. In the installation location column, $sqoop_home represents the path to your Sqoop installation.

Table 1. Files in the Couchbase Hadoop Connector package
File name Description Installation location

couchbase-client-1.4.4.bundled.jar

A library dependency of the connector that handles the basic communications with the Couchbase cluster

$sqoop_home/lib

couchbase-config.xml

Property file used to register a ManagerFactory for the connector with Sqoop

$sqoop_home/conf

couchbase-hadoop-plugin-1.2.0.jar

Couchbase Hadoop Connector

$sqoop_home/lib

couchbase-manager.xml

Property file that tells Sqoop where the ManagerFactory defined in the couchsqoop-config.xml resides

$sqoop_home/conf/managers.d

install.sh

Couchbase Hadoop Connector installation script

Not applicable

jettison-1.1.jar

A dependency of the Couchbase client

$sqoop_home/lib

netty-3.5.5.Final.jar

A dependency of the Couchbase client

$sqoop_home/lib

spymemcached-2.11.4.jar

A library dependency of the Couchbase client that provides networking and core protocol handling for data transfer

$sqoop_home/lib

Script-based installation

Script-based installation is done through the use of the install.sh script that comes with the connector download. The script takes one argument, the path to your Sqoop installation. The basic command format for invoking the script is:

shell> sh install.sh path_to_sqoop_home

In an HDP deployment, Sqoop is located at /usr/hdp/current/sqoop-client, which you use as the path to the Sqoop installation. For HDP, invoke the installation script as follows:

shell> sh install.sh /usr/hdp/current/sqoop-client

Manual installation

To install the Couchbase Hadoop Connector manually, copy each JAR and XML file listed in Table 1 into the directory specified in the installation location column.

Uninstallation

Uninstallation of the connector requires removal of all of the files that were added to Sqoop during installation. To uninstall the files, cd into your Sqoop home directory and execute the following command:

shell> rm lib/couchbase-hadoop-plugin-1.2.0.jar \
    lib/spymemcached-2.11.4.jar \
    lib/jettison-1.1.jar \
    lib/netty-3.5.5.Final.jar \
    lib/couchbase-client-1.4.4.bundled.jar \
    conf/couchbase-config.xml \
    conf/managers.d/couchbase-manager.xml

Using Sqoop

The Couchbase Hadoop Connector can be used with a variety of command line tools provided by Sqoop. In this section we discuss the usage of each tool.

Tables

Since Sqoop is built for a relational model it requires that the user specifies a table to import and export into Couchbase. The Couchbase Hadoop Connector uses the ‑‑table option to specify the type of data stream for importing and exporting into Couchbase.

For exports the user must enter a value for the --table option though what is entered will not be used by the connector.

For imports the table command accepts two values and will exit reporting errors with invalid input.

  • DUMP—Causes all keys currently in Couchbase to be read into HDFS. Any data items which are received by the Couchbase cluster while this command is running will also be passed along by the connector meaning new or changed items are part of the dump. However, items removed while the dump is running will not be removed from the output.

  • BACKFILL_##—Streams all key mutations for a given amount of time (in minutes). This is best used to sample a bucket in a cluster for a period of time.

For the --table value for the BACKFILL table that a time should be put in place of the brackets. For example BACKFILL_5 means stream key mutations in the Couchbase server for 5 minutes and then stop the stream.

Connect string

A connect string option is required to connect to Couchbase. This can be specified with --connect as an argument to the sqoop command. The following example shows some connect strings:

http://10.2.1.55:8091/pools
http://10.2.1.55:8091/pools,http://10.2.1.56:8091/pools

When creating your connect strings, replace the IP address above with the host name or IP address of one or more nodes in your Couchbase Cluster. If you have multiple servers you can list them in a comma-separated list.

Connecting to different buckets

By default the Couchbase Hadoop Connector connects to the default bucket. If you want to connect to a bucket other than the default bucket you can specify the bucket name with the --username option. If the bucket has a password, use the --password option followed by the password.

Note that there are several variations on how the password may be specified to Sqoop. The -P argument allows the password to be supplied on the command line and the --password-file argument allows the password to be read from a file in HDFS.

Importing

Importing data to your cluster requires the use of the Sqoop import command followed by the parameters --connect and --table.

The following example command dumps all items from Couchbase into HDFS. Since the Couchbase Java Client has support for a number of different data types, all values are normalized to strings when being written to a Hadoop text file.

shell> sqoop import --connect http://10.2.1.55:8091/pools --table DUMP

The following example command streams all item mutations from Couchbase into HDFS for a period of 10 minutes.

shell> sqoop import --connect http://10.2.1.55:8091/pools --table BACKFILL_10

Sqoop provides many more options to the import command than are covered in this document. Run sqoop import help for a list of all options and see the Sqoop documentation for more details about these options.

You have a number of options for how to supply the password when accessing a bucket. The following examples are equivalent for a bucket named mybucket that uses the password mypassword, given the argument to --password-file contains the password without a newline or carriage return.

shell> sqoop import --username mybucket -P --verbose \
    --connect http://10.2.1.55:8091/pools --table DUMP
shell> sqoop import --username mybucket --password mypassword --verbose \
    --connect http://10.2.1.55:8091/pools --table DUMP
shell> sqoop import --username mybucket --password-file passwordfile \
    --verbose --connect http://10.2.1.55:8091/pools --table DUMP

Some options that may be important in your import are those that define what delimiters Sqoop uses when writing the records. The default is the comma (,) character. Through the sqoop command you may specify a different delimiter if, for instance, it’s likely that the item’s key or value may contain a comma.

When the import job executes, it also generates a .java source code file that can facilitate reading and writing the records imported by other Hadoop MapReduce jobs. If, for instance, the job run was a DUMP, Sqoop generates a DUMP.java source code file.

Exporting

Exporting data to your cluster requires the use of the sqoop export command followed by the parameters --connect, --export-dir, and --table.

The following example exports all records from the files in the HDFS directory specified by --export-dir into Couchbase.

shell> sqoop export --connect http://10.2.1.55:8091/pools \
    --table couchbaseExportJob \
    --export-dir data_for_export

Sqoop provides many more options to the export command than we cover in this document. Run sqoop export help for a list of all options and see the Sqoop documentation for more details about these options.

Some options that may be important in your export are those that define what delimiters Sqoop uses when reading the records from the Hadoop text file to export to Couchbase. The default is the comma (,) character. Through the sqoop command you may specify a different delimiter.

When the export job executes, it also generates a .java source code file that shows how the data was read. If, for instance, the job run had the argument --table couchbaseExportJob, Sqoop generates a couchbaseExportJob.java source code file.

List table

Sqoop has a tool called list-tables. Couchbase does not have a notion of tables, but we use DUMP and BACKFILL_## as values to the --table option.

Since there is no real purpose to the list-tables command in the case of the Couchbase Hadoop Connector, it is not recommended you use this argument to Sqoop.

Import all tables

Sqoop has a tool called import-all-tables. Couchbase does not have a notion of tables.

Since there is no real purpose to the import-all-tables command in the case of the Couchbase Hadoop Connector, it is not recommended you use this argument to Sqoop.

Limitations

While Couchbase provides many great features to import and export data from Couchbase to Hadoop there is some functionality that the connector doesn’t implement in Sqoop. These are the known limitations:

  • Querying: You cannot run queries on Couchbase. All tools that attempt to do this will fail with a NotSupportedException. Querying will be added to future Couchbase products designed to integrate with Hadoop.

  • list-databases tool: Even though Couchbase is a multitenant system that allows for multiple buckets (which are analogous to databases) here is no way of listing these buckets from Sqoop. The list of buckets is available through the Couchbase Cluster web console.

  • eval-sql tool: Couchbase does not use SQL, so this tool is not appropriate.

  • The Couchbase Hadoop Connector does not automatically handle some classes of failures in a Couchbase cluster or changes to cluster topology while the Sqoop task is being run.