CouchSqoop: A Couchbase Plugin for Sqoop

Introduction

If you are reading this then you have just downloaded the Couchbase Sqoop plugin. This plugin allows you to connect to Couchbase Server 2.0 or higher or Membase Server 1.7.1+ and stream keys into HDFS or Hive for processing with Hadoop. Note that in this document we will refer to our database as Couchbase, but if you are using Membase everything will still work correctly. If you have used Sqoop before for doing imports and exports from other databases then using this plugin should be straightforward since it uses a similar command line argument structure.

Installation

The installation process for the Couchbase Sqoop plugin is simple. When you download the plugin from Cloudera you should find a set of files that need to be moved into you Sqoop installation. These files along with a short description of why they are needed are listed below.

  • couchbase-hadoop-plugin-1.0.jar — This is the jar file that contains all of the source code that makes Sqoop read data from Couchbase.

  • couchbase-config.xml — This is a property file used to register a ManagerFactory for the Couchbase plugin with Sqoop.

  • couchbase-manager.xml — This property file tells Sqoop what jar the ManagerFactory defined in couchsqoop-config.xml resides.

  • spymemcached-2.8-preview3.jar — This is the client jar used by our plugin to read and write data from Couchbase.

  • jettison-1.1.jar - This is a dependency of memcached-2.7.jar.

  • netty-3.1.5GA.jar - This is a dependency of memcached-2.7.jar.

  • install.sh — A script to automatically install the Couchbase plugin files to Sqoop.

Automatic Installation

Automatic installation is done through the use of the install.sh script that comes with the plugin download. The script takes one argument, the path to your Sqoop installation. Below is an example of how to use the script.

shell> ./install.sh path_to_sqoop_home

Manual Installation

Manual installation of the Couchbase plugin requires copying the files downloaded from Cloudera into your Sqoop installation. Below are a list of files that contained in the plugin and the name of the directory in your Sqoop installation to copy each file to.

  • couchbase-hadoop-plugin-1.0.jar — lib

  • spymemcached-2.8.jar — lib

  • jettison-1.1.jar — lib

  • netty-3.1.5GA.jar — lib

  • couchbase-config.xml — conf

  • couchbase-manager.xml — conf/managers.d

Uninstall

Uninstallation of the plugin requires removal of all of the files that were added to Sqoop during installation. To do this cd into your Sqoop home directory and execute the following command:

shell> rm lib/couchbase-hadoop-plugin-1.0.jar lib/spymemcached-2.8.jar \
    lib/jettison-1.1.jar lib/netty-3.1.5GA.jar \
    conf/couchbase-config.xml conf/managers.d/couchbase-manager.xml

Using Sqoop

The Couchbase Sqoop Plugin can be used with a variety of command line tools that are provided by Sqoop. In this section we discuss the usage of each tool.

Tables

Since Sqoop is built for a relational model it requires that the user specifies a table to import and export into Couchbase. The Couchbase plugin uses the --table option to specify the type of tap stream for importing and exporting into Couchbase. For exports the user must enter a value for the --table option even though what is entered will not actually be used by the plugin. For imports the table command can take on only two values.

  • DUMP — Causes all keys currently in Couchbase to be read into HDFS.

  • BACKFILL_## — Streams all key mutations for a given amount of time (in minutes).

For the --table value for the BACKFILL table that a time should be put in place of the brackets. For example BACKFILL_5 means stream key mutations in the Couchbase server for 5 minutes and then stop the stream.

For exports a value for --table is required, but the value will not be used. Any value used for the --table option when doing export will be ignored by the Couchbase plugin.

Connect String

A connect string option is required in order to connect to Couchbase. This can be specified with --connect on the command line. Below are two examples of connect strings.

http://10.2.1.55:8091/pools
http://10.2.1.55:8091/pools,http://10.2.1.56:8091/pools

When creating your connect strings simply replace the IP address above with the IP address of your Couchbase sever. If you have multiple servers you can list them in a comma-separated list.

Why list multiple servers? Let’s say you create a backfill stream for 10,080 minutes or one week. In that time period you might have a server crash, have to add another server, or remove a server from your cluster. Providing an address to each server allows an import and export command to proceed through topology changes to your cluster. In the first example above if you had a two-node cluster and 10.2.1.55 goes down then the import will fail even though the entire cluster didn’t go down. If you list both machines then the import will continue unaffected by the downed server and your import will complete successfully.

Connecting to Different Buckets

By default the Couchbase plugin connects to the default bucket. If you want to connect to a bucket other than the default bucket you can specify the bucket name with the --username option. If you have to connect to a SASL bucket use the --password option followed by the buckets password.

Importing

Importing data to your cluster requires the use of the Sqoop import command followed by the parameters --connect and --table. Below are some example imports.

shell> bin/sqoop import --connect http://10.2.1.55:8091/pools --table DUMP

This will dump all key-value pairs from Couchbase into HDFS.

shell> bin/sqoop import --connect http://10.2.1.55:8091/pools --table BACKFILL_10

This will stream all key-value mutations from Couchbase into HDFS.

Sqoop provides many more options to the import command than we will cover in this document. Run bin/sqoop import help for a list of all options and see the Sqoop documentation for more details about these options.

Exporting

Exporting data to your cluster requires the use of the Sqoop import command followed by the parameters --connect, --export-dir, and --table. Below are some example imports.

shell> bin/sqoop export --connect http://10.2.1.55:8091/pools --table garbage_value --export-dir dump_4-12-11

This will export all key-value pairs from the HDFS directory specified by export-dir into Couchbase.

shell> bin/sqoop export -connect http://10.2.1.55:8091/pools --table garbage_value --export-dir backfill_4-29-11

This will export all key-value pairs from the HDFS directory specified by --export-dir into Couchbase.

Sqoop provides many more options to the export command than we will cover in this document. Run bin/sqoop export help for a list of all options and see the Sqoop documentation for more details about these options.

List table

Sqoop has a tool called list tables that in a relational database has a lot of meaning since it shows us what kinds of things we can import. As noted in previous sections, Couchbase doesn’t have a notion of tables, but we use DUMP and BACKFILL_## as values to the --table option. As a result using the list-tables tool does the following.

shell> bin/sqoop list-tables --connect http://10.2.1.55:8091/pools DUMP BACKFILL_5

All this does in the case of the Couchbase plugin is remind us what we can use as an argument to the --table option. We give BACKFILL a time of 5 minutes so that the import-all-tables tool functions properly.

Sqoop provides many more options to the list-tables command than we will cover in this document. Run bin/sqoop list-tables help for a list of all options and see the Sqoop documentation for more details about these options.

Import All Tables

In the Couchbase plugin the import-all-tables tool dumps all keys in Couchbase into HDFS and then streams all key-value mutations into Hadoop for five minutes. This command is a direct result of running import on each table from the list-tables command. Below is an example of this command.

shell> bin/sqoop import-all-tables -connect http://10.2.1.55:8091/pools

Sqoop provides many more options to the import-all-tables command than we will cover in this document. Run bin/sqoop import-all-tables help for a list of all options and see the Sqoop documentation for more details about these options.

Limitations

While Couchbase provides many great features to import and export data from Couchbase to Hadoop there is some functionality that the plugin doesn’t implement in Sqoop. Here’s a list of what isn’t implemented.

  • Querying: You cannot run queries on Couchbase. All tools that attempt to do this will fail with a NotSupportedException.

  • list-databases tool: Even though Couchbase is a multi-tenant system that allows for multiple databases. There is no way of listing these databases from Sqoop.

  • eval-sql tool: Couchbase doesn’t use SQL so this tool will not work.

Internals

The Couchbase plugin consists of two parts. The first part is the addition of code that allows the mappers in Hadoop to read the values sent to it from Couchbase. The second part is the use of the Spymemcached client to get data to and from Couchbase. For imports the plugin uses the tap stream feature in Spymemcached. Tap streams allow users to stream large volumes of data from Couchbase into other applications and are also at the heart of replication in Couchbase. They enable a fast way to move data from Couchbase to Hadoop for further processing. Getting data back into Couchbase runs through the front end of Couchbase using the memcached protocol.

For more information about the internals of Sqoop see the Sqoop documentation.

Appendix: Licenses

This documentation and associated software is subject to the following licenses.

Documentation License

This documentation in any form, software or printed matter, contains proprietary information that is the exclusive property of Couchbase. Your access to and use of this material is subject to the terms and conditions of your Couchbase Software License and Service Agreement, which has been executed and with which you agree to comply. This document and information contained herein may not be disclosed, copied, reproduced, or distributed to anyone outside Couchbase without prior written consent of Couchbase or as specifically provided below. This document is not part of your license agreement nor can it be incorporated into any contractual agreement with Couchbase or its subsidiaries or affiliates.

Use of this documentation is subject to the following terms:

You may create a printed copy of this documentation solely for your own personal use. Conversion to other formats is allowed as long as the actual content is not altered or edited in any way. You shall not publish or distribute this documentation in any form or on any media, except if you distribute the documentation in a manner similar to how Couchbase disseminates it (that is, electronically for download on a Web site with the software) or on a CD-ROM or similar medium, provided however that the documentation is disseminated together with the software on the same medium. Any other use, such as any dissemination of printed copies or use of this documentation, in whole or in part, in another publication, requires the prior written consent from an authorized representative of Couchbase. Couchbase and/or its affiliates reserve any and all rights to this documentation not expressly granted above.

This documentation may provide access to or information on content, products, and services from third parties. Couchbase Inc. and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Couchbase Inc. and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.

The information contained herein is subject to change without notice and is not warranted to be error free. If you find any errors, please report them to us in writing.