Migration Guide

concept

Helpful information when migrating from the Spark 2.x connector to the Spark 3.x connector.

Upfront Considerations

When migrating from the version 2 of the spark connector to version 3, the general guideline is as follows: the lower the APIs, the more work to migrate. Most of the changes you will likely need to make are concerning configuration and RDD access. There are some changes in the SparkSQL area, but not as many.

Please keep in mind that this is a new major version of the connector, so we took the liberty to break APIs where it made sense to future-proof the connector and prepare it for modern Couchbase features (including scopes and collections).

The new connector builds on two fundamentals:

Spark 3.x
Couchbase Scala SDK 1.x

Since Spark itself is written in Scala and Couchbase now provides a native Scala SDK, the connector builds directly on top of it. As a result, on lower level APIs (like RDDs) where SDK case classes are accepted / returned, please refer to the official SDK documentation if you are in doubt how to use these APIs.

This also bumps up the minimum server version required, so please make sure to run at least Couchbase Server 6.x or later.

Configuration and Connection Management

Note that the old com.couchbase prefix is no longer supported, only the spark.couchbase prefix can be used with the connector going forward.

The configuration properties have been modified to reflect that role-based access control (RBAC) is now required.

The minimum amount of configuration through spark config properties are the following:

Table 1. Required Config Properties
Property	Description
spark.couchbase.connectionString	The connection string / hostnames of the cluster
spark.couchbase.username	The name of your RBAC user
spark.couchbase.password	The password of your RBAC user

Here is an example:

val spark = SparkSession
  .builder()
  .appName("Migration")
  .config("spark.couchbase.connectionString", "127.0.0.1")
  .config("spark.couchbase.username", "Administrator")
  .config("spark.couchbase.password", "password")
  .getOrCreate()

There are other properties which are described in the configuration section that can help you make the other APIs more comfortable to use (i.e. spark.couchbase.implictBucket), but they are not required during a migration step.

Removed APIs

Due to decreased demand the View APIs have been not ported to the new major version of the connector. Nearly all functionality can be (better) achieved by using Query, Analytics or Full Text search. Please convert to those APIs if necessary.

RDD Operations

RDD operations on the SparkContext are still available, but have changed in their signatures.

Table 2. RDD SparkContext Functions
2.x	3.x
couchbaseGet	couchbaseGet
couchbaseSubdocLookup	couchbaseLookupIn
couchbaseSubdocMutate	couchbaseMutateIn
couchbaseView	(removed)
couchbaseSpatialView	(removed)
couchbaseQuery	couchbaseQuery
couchbaseAnalytics	couchbaseAnalyticsQuery
(not available)	couchbaseSearchQuery
(not available)	couchbaseUpsert
(not available)	couchbaseReplace
(not available)	couchbaseInsert
(not available)	couchbaseRemove

In addition to the SparkContext, RDD APIs are also still available on the RDDs itself:

Table 3. RDD Functions
2.x	3.x
saveToCouchbase(StoreMode.UPSERT)	couchbaseUpsert
saveToCouchbase(StoreMode.INSERT_AND_FAIL or INSERT_AND_IGNORE)	couchbaseInsert
saveToCouchbase(StoreMode.REPLACE_AND_FAIL or REPLACE_AND_IGNORE)	couchbaseReplace
(not available)	couchbaseRemove
couchbaseSubdocMutate	couchbaseMutateIn

Please see the section on working with RDDs on all the different required and optional arguments that can be applied to each of those functions.

Spark SQL

The main difference when working with Spark SQL is how to configure the DataFrame. First, it lives under a different path for query. In the previous version of the connector the DataSource has been located under com.couchbase.spark.sql.DefaultSource or as a method if the implict import has been used.

In the new version, the different sources (Query, Analytics and KeyValue) are automatically registered with specific names so they can be used directly:

Accessing Query:

spark.read.format("couchbase.query")

Accessing Analytics:

spark.read.format("couchbase.analytics")

Accessing KeyValue (Write only):

df.write.format("couchbase.kv")

The other difference is how the DataFrame is configured. The query DataFrame is configured with properties from QueryOptions, analytics with AnalyticsOptions and key value with KeyValueOptions. Please see the section on Spark SQL for all the different configuration properties that are available.