If you are reading this then you have just downloaded the Couchbase Sqoop plugin. This plugin allows you to connect to Couchbase Server 2.0 or higher or Membase Server 1.7.1+ and stream keys into HDFS or Hive for processing with Hadoop. Note that in this document we will refer to our database as Couchbase, but if you are using Membase everything will still work correctly. If you have used Sqoop before for doing imports and exports from other databases then using this plugin should be straightforward since it uses a similar command line argument structure.
The installation process for the Couchbase Sqoop plugin is simple. When you download the plugin from Cloudera you should find a set of files that need to be moved into you Sqoop installation. These files along with a short description of why they are needed are listed below.
couchbase-hadoop-plugin-1.0.jar — This is the jar file that contains all of the source code that makes Sqoop read data from Couchbase.
couchbase-config.xml — This is a property file used to register a ManagerFactory for the Couchbase plugin with Sqoop.
couchbase-manager.xml — This property file tells Sqoop what jar the ManagerFactory defined in couchsqoop-config.xml resides.
spymemcached-2.8-preview3.jar — This is the client jar used by our plugin to read and write data from Couchbase.
jettison-1.1.jar - This is a dependency of memcached-2.7.jar.
netty-3.1.5GA.jar - This is a dependency of memcached-2.7.jar.
install.sh — A script to automatically install the Couchbase plugin files to Sqoop.
Automatic installation is done through the use of the install.sh script that comes with the plugin download. The script takes one argument, the path to your Sqoop installation. Below is an example of how to use the script.
shell> ./install.sh path_to_sqoop_home
Manual installation of the Couchbase plugin requires copying the files downloaded from Cloudera into your Sqoop installation. Below are a list of files that contained in the plugin and the name of the directory in your Sqoop installation to copy each file to.
couchbase-hadoop-plugin-1.0.jar — lib
spymemcached-2.8.jar — lib
jettison-1.1.jar — lib
netty-3.1.5GA.jar — lib
couchbase-config.xml — conf
couchbase-manager.xml — conf/managers.d
Uninstallation of the plugin requires removal of all of the files that were added to Sqoop during installation. To do this cd into your Sqoop home directory and execute the following command:
shell> rm lib/couchbase-hadoop-plugin-1.0.jar lib/spymemcached-2.8.jar \ lib/jettison-1.1.jar lib/netty-3.1.5GA.jar \ conf/couchbase-config.xml conf/managers.d/couchbase-manager.xml
The Couchbase Sqoop Plugin can be used with a variety of command line tools that are provided by Sqoop. In this section we discuss the usage of each tool.
Since Sqoop is built for a relational model it requires that the user specifies
a table to import and export into Couchbase. The Couchbase plugin uses the
--table option to specify the type of tap stream for importing and exporting
into Couchbase. For exports the user must enter a value for the
even though what is entered will not actually be used by the plugin. For imports
the table command can take on only two values.
DUMP — Causes all keys currently in Couchbase to be read into HDFS.
BACKFILL_## — Streams all key mutations for a given amount of time (in
--table value for the
BACKFILL table that a time should be put in
place of the brackets. For example
BACKFILL_5 means stream key mutations in
the Couchbase server for 5 minutes and then stop the stream.
For exports a value for
--table is required, but the value will not be used.
Any value used for the
--table option when doing export will be ignored by the
A connect string option is required in order to connect to Couchbase. This can
be specified with
--connect on the command line. Below are two examples of
When creating your connect strings simply replace the IP address above with the IP address of your Couchbase sever. If you have multiple servers you can list them in a comma-separated list.
Why list multiple servers? Let’s say you create a backfill stream for 10,080 minutes or one week. In that time period you might have a server crash, have to add another server, or remove a server from your cluster. Providing an address to each server allows an import and export command to proceed through topology changes to your cluster. In the first example above if you had a two-node cluster and 10.2.1.55 goes down then the import will fail even though the entire cluster didn’t go down. If you list both machines then the import will continue unaffected by the downed server and your import will complete successfully.
By default the Couchbase plugin connects to the default bucket. If you want to
connect to a bucket other than the default bucket you can specify the bucket
name with the
--username option. If you have to connect to a SASL bucket use
--password option followed by the buckets password.
Importing data to your cluster requires the use of the Sqoop import command
followed by the parameters
--table. Below are some example
shell> bin/sqoop import --connect http://10.2.1.55:8091/pools --table DUMP
This will dump all key-value pairs from Couchbase into HDFS.
shell> bin/sqoop import --connect http://10.2.1.55:8091/pools --table BACKFILL_10
This will stream all key-value mutations from Couchbase into HDFS.
Sqoop provides many more options to the import command than we will cover in
this document. Run
bin/sqoop import help for a list of all options and see the
Sqoop documentation for more details about these options.
Exporting data to your cluster requires the use of the Sqoop import command
followed by the parameters
--table. Below are
some example imports.
shell> bin/sqoop export --connect http://10.2.1.55:8091/pools --table garbage_value --export-dir dump_4-12-11
This will export all key-value pairs from the HDFS directory specified by export-dir into Couchbase.
shell> bin/sqoop export -connect http://10.2.1.55:8091/pools --table garbage_value --export-dir backfill_4-29-11
This will export all key-value pairs from the HDFS directory specified by
--export-dir into Couchbase.
Sqoop provides many more options to the export command than we will cover in
this document. Run
bin/sqoop export help for a list of all options and see the
Sqoop documentation for more details about these options.
Sqoop has a tool called list tables that in a relational database has a lot of
meaning since it shows us what kinds of things we can import. As noted in
previous sections, Couchbase doesn’t have a notion of tables, but we use
BACKFILL_## as values to the
--table option. As a result using the
list-tables tool does the following.
shell> bin/sqoop list-tables --connect http://10.2.1.55:8091/pools DUMP BACKFILL_5
All this does in the case of the Couchbase plugin is remind us what we can use
as an argument to the
--table option. We give
BACKFILL a time of 5 minutes
so that the import-all-tables tool functions properly.
Sqoop provides many more options to the list-tables command than we will cover
in this document. Run
bin/sqoop list-tables help for a list of all options and
see the Sqoop documentation for more details about these options.
In the Couchbase plugin the import-all-tables tool dumps all keys in Couchbase into HDFS and then streams all key-value mutations into Hadoop for five minutes. This command is a direct result of running import on each table from the list-tables command. Below is an example of this command.
shell> bin/sqoop import-all-tables -connect http://10.2.1.55:8091/pools
Sqoop provides many more options to the import-all-tables command than we will
cover in this document. Run
bin/sqoop import-all-tables help for a list of all
options and see the Sqoop documentation for more details about these options.
While Couchbase provides many great features to import and export data from Couchbase to Hadoop there is some functionality that the plugin doesn’t implement in Sqoop. Here’s a list of what isn’t implemented.
Querying: You cannot run queries on Couchbase. All tools that attempt to do this will fail with a NotSupportedException.
list-databases tool: Even though Couchbase is a multi-tenant system that allows for multiple databases. There is no way of listing these databases from Sqoop.
eval-sql tool: Couchbase doesn’t use SQL so this tool will not work.
The Couchbase plugin consists of two parts. The first part is the addition of code that allows the mappers in Hadoop to read the values sent to it from Couchbase. The second part is the use of the Spymemcached client to get data to and from Couchbase. For imports the plugin uses the tap stream feature in Spymemcached. Tap streams allow users to stream large volumes of data from Couchbase into other applications and are also at the heart of replication in Couchbase. They enable a fast way to move data from Couchbase to Hadoop for further processing. Getting data back into Couchbase runs through the front end of Couchbase using the memcached protocol.
For more information about the internals of Sqoop see the Sqoop documentation.
This documentation and associated software is subject to the following licenses.
This documentation in any form, software or printed matter, contains proprietary information that is the exclusive property of Couchbase. Your access to and use of this material is subject to the terms and conditions of your Couchbase Software License and Service Agreement, which has been executed and with which you agree to comply. This document and information contained herein may not be disclosed, copied, reproduced, or distributed to anyone outside Couchbase without prior written consent of Couchbase or as specifically provided below. This document is not part of your license agreement nor can it be incorporated into any contractual agreement with Couchbase or its subsidiaries or affiliates.
Use of this documentation is subject to the following terms:
You may create a printed copy of this documentation solely for your own personal use. Conversion to other formats is allowed as long as the actual content is not altered or edited in any way. You shall not publish or distribute this documentation in any form or on any media, except if you distribute the documentation in a manner similar to how Couchbase disseminates it (that is, electronically for download on a Web site with the software) or on a CD-ROM or similar medium, provided however that the documentation is disseminated together with the software on the same medium. Any other use, such as any dissemination of printed copies or use of this documentation, in whole or in part, in another publication, requires the prior written consent from an authorized representative of Couchbase. Couchbase and/or its affiliates reserve any and all rights to this documentation not expressly granted above.
This documentation may provide access to or information on content, products, and services from third parties. Couchbase Inc. and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Couchbase Inc. and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.
The information contained herein is subject to change without notice and is not warranted to be error free. If you find any errors, please report them to us in writing.