Using the Spark Shell

concept

The interactive shell can be used together with the couchbase connector for quick and easy data exploration.

Getting Started

The Spark shell provides an easy and convenient way to prototype certain operations quickly,without having to develop a full program, packaging it and then deploying it.

You need to download Apache Spark from the website, then navigate into the bin directory and run the spark-shell command:

Downloads/spark-3.2.0-bin-hadoop2.7/bin
❯ ./spark-shell -h
Usage: ./bin/spark-shell [options]

Scala REPL options:
  -I <file>                   preload <file>, enforcing line-by-line interpretation

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
...

If you run the Spark shell as it is, you will only have the built-in Spark commands available. If you want to use it with the Couchbase Connector, the easiest way is to provide a specific argument that locates the dependency and pulls it in:

./spark-shell --packages com.couchbase.client:spark-connector_2.12:3.2.0

The final step that needs to be undertaken is to specify all required properties (connectionString, username and password) so that the connector can bootstrap:

./spark-shell --packages com.couchbase.client:spark-connector_2.12:3.2.0 -c spark.couchbase.connectionString=127.0.0.1 -c spark.couchbase.username=user -c spark.couchbase.password=pass

Now you’re all set!

Usage

Once you’ve loaded the shell, both the SparkContext (sc) and the surrounding SparkSession are ready to go:

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@2703fabd

scala> spark
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4e7fbd6c

To load the Couchbase-specific implicit imports, run the following command:

scala> import com.couchbase.spark._
import com.couchbase.spark._

Now you can run all commands like in a regular program, just in an interactive fashion. The following example retrieves a document through the KeyValue API:

scala> import com.couchbase.spark.kv.Get
import com.couchbase.spark.kv.Get

scala> import com.couchbase.client.scala.json.JsonObject
import com.couchbase.client.scala.json.JsonObject

scala> sc.couchbaseGet(Seq(Get("airline_10")), Keyspace(bucket = Some("travel-sample"))).collect().foreach(result => println(result.contentAs[JsonObject]))
Success({"country":"United States","iata":"Q5","name":"40-Mile Air","callsign":"MILE-AIR","icao":"MLA","id":10,"type":"airline"})

You can also make use of the first-class query integration. The following example creates a data frame for airlines travel-sample bucket.

scala> import com.couchbase.spark.query._
import com.couchbase.spark.query._

scala>     val airlines = spark.read.format("couchbase.query").option(QueryOptions.Filter, "type = 'airline'").option(QueryOptions.Bucket, "travel-sample").load()
airlines: org.apache.spark.sql.DataFrame = [__META_ID: string, callsign: string ... 6 more fields]

Now you can print the schema and run ad-hoc data exploration:

scala> airlines.printSchema
root
 |-- __META_ID: string (nullable = true)
 |-- callsign: string (nullable = true)
 |-- country: string (nullable = true)
 |-- iata: string (nullable = true)
 |-- icao: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- type: string (nullable = true)

scala> airlines.show(5)
+-------------+--------+--------------+----+----+-----+-------------------+-------+
|    __META_ID|callsign|       country|iata|icao|   id|               name|   type|
+-------------+--------+--------------+----+----+-----+-------------------+-------+
|   airline_10|MILE-AIR| United States|  Q5| MLA|   10|        40-Mile Air|airline|
|airline_10123|     TXW| United States|  TQ| TXW|10123|        Texas Wings|airline|
|airline_10226|  atifly| United States|  A1| A1F|10226|             Atifly|airline|
|airline_10642|    null|United Kingdom|null| JRB|10642|Jc royal.britannica|airline|
|airline_10748|  LOCAIR| United States|  ZQ| LOC|10748|             Locair|airline|
+-------------+--------+--------------+----+----+-----+-------------------+-------+
only showing top 5 rows