Analytics using the SDK

    +
    Parallel data management for complex queries over many records, using a familiar N1QL-like syntax.

    For complex and long-running queries, involving large ad hoc join, set, aggregation, and grouping operations, the Couchbase Data Platform offers the Couchbase Analytics Service (CBAS). This is the analytic counterpart to our operational data focussed Query Service. The analytics service is available in Couchbase Data Platform 6.0 and later.

    Getting Started

    After familiarizing yourself with our introductory primer, in particular creating a dataset and linking it to a bucket, try Couchbase Analytics using the Python SDK. Intentionally, the API for analytics is nearly identical to that of the query service.

    Before starting, here’s all imports used in the following examples:

    
    from couchbase.cluster import Cluster, ClusterOptions
    from couchbase.exceptions import CouchbaseException
    from couchbase.cluster import AnalyticsOptions, PasswordAuthenticator

    Here’s a complete example of doing an analytics query and handling the results:

    cluster = Cluster.connect("localhost", ClusterOptions(PasswordAuthenticator("Administrator", "password")))
    try:
        result = cluster.analytics_query("select \"hello\" as greeting")
    
        for row in result.rows():
            print("Found row: " + row)
    
        print("Reported execution time: "
              + result.metaData().metrics().executionTime())
    except CouchbaseException as ex:
        import traceback
        traceback.print_exc()

    Let’s break it down. An analytics query is always performed at the Cluster level, using the analytics_query method. It takes the statement as a required argument and then allows to provide additional options if needed (in the example above, no options are specified).

    Once a result returns you can iterate over the returned rows.

    If something goes wrong during the execution of the query, a derivative of the CouchbaseException will be thrown that also provides additional context on the operation.

    Open Buckets and Cluster-Level Queries

    If you are using a cluster older than Couchbase Server 6.5, it is required that there is at least one bucket open before performing a cluster-level query. If you fail to do so, the SDK will return a FeatureNotAvailableException with a descriptive error message asking you to open one.

    Parameterized Queries

    Supplying parameters as individual arguments to the query allows the analytics engine to optimize the parsing and planning of the query. You can either supply these parameters by name or by position.

    The first example shows how to provide them by name:

    result = cluster.analytics_query(
        "select count(*) from airports where country = $country",
        country="France")

    The second example by position:

    result = cluster.analytics_query(
        "select count(*) from airports where country = ?",
        "France")

    What style you choose is up to you, for readability in more complex queries we generally recommend using the named parameters.

    Note that you cannot use parameters in all positions. If you put it in an unsupported place the server will respond with a ParsingFailureException.

    The Analytics Result

    When performing an analytics query, the response you receive is an AnalyticsResult. If no exception gets raised the request succeeded and provides access to both the rows returned and also associated AnalyticsMetaData.

    Rows can be consumed directly, or via a serialiser.

    result = cluster.analytics_query(
        "select * from `travel-sample` limit 10"
    )
    for row in result.rows():
        print("Found row: " + row)

    Analytics Options

    The analytics service provides an array of options to customize your query. The following table lists them all:

    Table 1. Available Analytics Options
    Name Description

    client_context_id: str

    Sets a context ID returned by the service for debugging purposes.

    positional_parameters: Iterable[str]

    Allows to set positional arguments for a parameterized query.

    named_parameters: Dict[str,str]

    Allows to set named arguments for a parameterized query.

    priority: bool

    Assigns a different server-side priority to the query.

    raw: Dict[str, Any]

    Escape hatch to add arguments that are not covered by these options.

    read_only: bool

    Tells the client and server that this query is readonly.

    Client Context Id

    The SDK will always send a client context ID with each query, even if none is provided by the user. By default a UUID will be generated that is mirrored back from the analytics engine and can be used for debugging purposes. A custom string can always be provided if you want to introduce application-specific semantics into it (so that for example in a network dump it shows up with a certain identifier). Whatever is chosen, we recommend making sure it is unique so different queries can be distinguished during debugging or monitoring.

    import uuid
    result = cluster.analyticsQuery(
    "select ...",
    AnalyticsOptions(client_context_id="user-44{}".format(uuid.uuid4())))

    Priority

    By default, every analytics query has the same priority on the server. By setting this boolean flag to true, you are indicating that you need expedited dispatch in the analytice engine for this request.

    result = cluster.analytics_query(
        "select ...",
        AnalyticsOptions(priority=True)
    )

    Readonly

    If the query is marked as readonly, both the server and the SDK can improve processing of the operation. On the client side, the SDK can be more liberal with retries because it can be sure that there are no state-mutating side-effects happening. The query engine will ensure that actually no data is mutated when parsing and planning the query.

    result = cluster.analytics_query(
        "select ...",
        readonly=True
    )

    Custom JSON Serializer

    Like with all JSON apis, it is possible to customize the JSON serializer. It allows to plug in your own library. This in turn makes it possible to serialize rows into PODs or other structures that your application defines and the SDK has no idea about.

    Please see the documentation transcoding and serialization for more information.