Analytics

  • how-to
Parallel data management for complex queries over many records, using a familiar SQL-like syntax.

This page covers using our operational Ruby SDK to connect to the Analytics Service of a Capella Operational or self-managed Couchbase Server cluster. As well as this row-based analytics service, a speedy, column-based analytics database is available for real-time analytics.

Analytics SDKs

SDKs for Enterprise Analytics — Couchbase’s analytical database for real time apps and operational intelligence (RT-OLAP) — are available for the .NET, Go, Java, Node.js, and Python platforms. See the Enterprise Analytics SDK pages for more information.

Currently, different SDKs are needed to connect to Capella Analytics — as this service does not have Enterprise Analytics' load balancer, and uses a different connection protocol. Capella Analytics SDKs (also known as Columnar SDKs) are available for the Go, Java, Node.js, and Python platforms. See the Capella Analytics SDK pages for more information.

For complex and long-running queries, involving large ad hoc join, set, aggregation, and grouping operations, Couchbase Data Platform offers the Couchbase Analytics Service (CBAS). This is the analytic counterpart to our operational data focussed Query Service. The analytics service is available in Couchbase Data Platform 6.0 and later (developer preview in 5.5).

Getting Started

After familiarizing yourself with our introductory primer, in particular creating a dataset and linking it to a bucket to shadow the operational data, try Couchbase Analytics using the Ruby SDK. Intentionally, the API for analytics is very similar to that of the query service. In these examples we will be using an airports dataset created on the travel-sample bucket.

result = cluster.analytics_query('SELECT "hello" AS greeting')
result.rows.each do |row|
  puts row
  #=> {"greeting"=>"hello"}
end
puts "Reported execution time: #{result.meta_data.metrics.execution_time}"
#=> Reported execution time: 14.392402ms

Queries

A query can either be simple or be parameterized. If parameters are used, they can either be positional or named:

Positional parameters
options = Cluster::AnalyticsOptions.new
options.positional_parameters(["France"])
result = cluster.analytics_query(
  'SELECT COUNT(*) FROM airports WHERE country = ?',
  options)
Named parameters
options = Cluster::AnalyticsOptions.new
options.named_parameters("country" => "France")
result = cluster.analytics_query(
  'SELECT COUNT(*) FROM airports WHERE country = $country',
  options)
As timeouts are propagated to the server by the client, a timeout set on the client side may be used to stop the processing of a request, in order to save system resources.

Options

Additional parameters may be sent as part of the query.

Table 1. AnalyticsOptions
Name Description

String #client_context_id

Provides a custom client context ID for this query; default is a random UUID.

Boolean #priority

Allows certain requests to have higher priority than others.

Boolean #readonly

Allows explicitly marking a query as being read-only, and not mutating any documents on the server side.

Symbol #scan_consistency

Specifies level of consistency for the query — :not_bounded, :request_plus.

Integer #scan_wait

The maximum duration (in milliseconds) the query engine is willing to wait before failing.

Integer #timeout

Timeout in milliseconds.

JsonTranscoder #transcoder

Transcoder to use on rows.

Here, we set a client_context_id:

options = Cluster::AnalyticsOptions.new
options.client_context_id = "user-44-#{rand}"
result = cluster.analytics_query(
  'SELECT * FROM airports WHERE country = "France" LIMIT 10',
  options)
puts result.meta_data.client_context_id
#=> user-44-0.9295598007016517

And here we set high priority for the query:

options = Cluster::AnalyticsOptions.new
options.priority = true
result = cluster.analytics_query(
  'SELECT * FROM airports WHERE country = "France" LIMIT 10',
  options)

Here we pass readonly to explicitly mark a query as being read only, and not mutating any documents on the server side.

options = Cluster::AnalyticsOptions.new
options.readonly = true
result = cluster.analytics_query(
  'SELECT * FROM airports WHERE country = "France" LIMIT 10',
  options)

Handling the Response

The analytics query result may contain various sorts of data and metadata, depending upon the nature of the query, as you will have seen when working through our introductory primer.

Errors caused by resource unavailability (such as timeouts and Operation cannot be performed during rebalance messages) leading to an automatic retry by the SDK.

MetaData

The metadata object contains useful metadata, such as Metrics and ClientContextID.

result = cluster.analytics_query("SELECT 1=1")
puts "Execution time: #{result.meta_data.metrics.execution_time}"

Scan Consistency

Like the Couchbase Query Service, and Search, Analytics allows :request_plus queries — ensuring results contain information from updated indexes:

#options = Cluster::AnalyticsOptions.new
#options.scan_consistency = :request_plus
#result = cluster.analytics_query(
  #'SELECT * FROM airports WHERE country = "France" LIMIT 10',
  #options)

Scoped Queries on Named Collections

Given a dataset created against a collection, for example:

CREATE DATASET `airports-collection` ON `travel-sample`.inventory.airport;

You can run a query as follows:

result = cluster.analytics_query('SELECT airportname, country FROM `travel-sample`.inventory.airport WHERE country="France" LIMIT 3')

In addition to running a query via the Cluster object, you can run one via the Scope object.

bucket = cluster.bucket("travel-sample")
scope = bucket.scope("inventory")
result = scope.analytics_query('SELECT airportname, country FROM airport WHERE country="France" LIMIT 3')