Migration Guide

  • concept
    +
    Helpful information when migrating from the Spark 2.x connector to the Spark 3.x connector.

    Upfront Considerations

    When migrating from the version 2 of the spark connector to version 3, the general guideline is as follows: the lower the APIs, the more work to migrate. Most of the changes you will likely need to make are concerning configuration and RDD access. There are some changes in the SparkSQL area, but not as many.

    Please keep in mind that this is a new major version of the connector, so we took the liberty to break APIs where it made sense to future-proof the connector and prepare it for modern Couchbase features (including scopes and collections).

    The new connector builds on two fundamentals:

    • Spark 3.x

    • Couchbase Scala SDK 1.x

    Since Spark itself is written in Scala and Couchbase now provides a native Scala SDK, the connector builds directly on top of it. As a result, on lower level APIs (like RDDs) where SDK case classes are accepted / returned, please refer to the official SDK documentation if you are in doubt how to use these APIs.

    This also bumps up the minimum server version required, so please make sure to run at least Couchbase Server 6.x or later.

    Configuration and Connection Management

    Note that the old com.couchbase prefix is no longer supported, only the spark.couchbase prefix can be used with the connector going forward.

    The configuration properties have been modified to reflect that role-based access control (RBAC) is now required.

    The minimum amount of configuration through spark config properties are the following:

    Table 1. Required Config Properties
    Property Description

    spark.couchbase.connectionString

    The connection string / hostnames of the cluster

    spark.couchbase.username

    The name of your RBAC user

    spark.couchbase.password

    The password of your RBAC user

    Here is an example:

    val spark = SparkSession
      .builder()
      .appName("Migration")
      .config("spark.couchbase.connectionString", "127.0.0.1")
      .config("spark.couchbase.username", "Administrator")
      .config("spark.couchbase.password", "password")
      .getOrCreate()

    There are other properties which are described in the configuration section that can help you make the other APIs more comfortable to use (i.e. spark.couchbase.implictBucket), but they are not required during a migration step.

    Removed APIs

    Due to decreased demand the View APIs have been not ported to the new major version of the connector. Nearly all functionality can be (better) achieved by using Query, Analytics or Full Text search. Please convert to those APIs if necessary.

    RDD Operations

    RDD operations on the SparkContext are still available, but have changed in their signatures.

    Table 2. RDD SparkContext Functions
    2.x 3.x

    couchbaseGet

    couchbaseGet

    couchbaseSubdocLookup

    couchbaseLookupIn

    couchbaseSubdocMutate

    couchbaseMutateIn

    couchbaseView

    (removed)

    couchbaseSpatialView

    (removed)

    couchbaseQuery

    couchbaseQuery

    couchbaseAnalytics

    couchbaseAnalyticsQuery

    (not available)

    couchbaseSearchQuery

    (not available)

    couchbaseUpsert

    (not available)

    couchbaseReplace

    (not available)

    couchbaseInsert

    (not available)

    couchbaseRemove

    In addition to the SparkContext, RDD APIs are also still available on the RDDs itself:

    Table 3. RDD Functions
    2.x 3.x

    saveToCouchbase(StoreMode.UPSERT)

    couchbaseUpsert

    saveToCouchbase(StoreMode.INSERT_AND_FAIL or INSERT_AND_IGNORE)

    couchbaseInsert

    saveToCouchbase(StoreMode.REPLACE_AND_FAIL or REPLACE_AND_IGNORE)

    couchbaseReplace

    (not available)

    couchbaseRemove

    couchbaseSubdocMutate

    couchbaseMutateIn

    Please see the section on working with RDDs on all the different required and optional arguments that can be applied to each of those functions.

    Spark SQL

    The main difference when working with Spark SQL is how to configure the DataFrame. First, it lives under a different path for query. In the previous version of the connector the DataSource has been located under com.couchbase.spark.sql.DefaultSource or as a method if the implict import has been used.

    In the new version, the different sources (Query, Analytics and KeyValue) are automatically registered with specific names so they can be used directly:

    Accessing Query:

    spark.read.format("couchbase.query")

    Accessing Analytics:

    spark.read.format("couchbase.analytics")

    Accessing KeyValue (Write only):

    df.write.format("couchbase.kv")

    The other difference is how the DataFrame is configured. The query DataFrame is configured with properties from QueryOptions, analytics with AnalyticsOptions and key value with KeyValueOptions. Please see the section on Spark SQL for all the different configuration properties that are available.