Data Definition Language (DDL)

    +

    This section lists all supported DDL statements in Couchbase Analytics.

    A user must have one of the following RBAC roles to be able to execute DDL statements: Admin, Cluster Admin, Analytics Admin, or Analytics Manager.
    No CREATE/DROP or CONNECT/DISCONNECT statement can be executed while the cluster topology changes (e.g. during rebalance). The evaluation of such DDL statements will fail and it can be reattempted after the topology change has completed.

    Entities

    These are the entities that are required to create shadow datasets for Data Service buckets.

    Dataverses

    A dataverse is a namespace for the other metadata entities (links, buckets, datasets, indexes). There is one Default dataverse that is available without creating it and that cannot be deleted. Because the Default dataverse is always available, creating a dataverse is optional.

    Datasets

    In Analytics, a dataset is a collection of data within a dataverse. By default, a dataset is linked to the Data service of the local Couchbase cluster. This type of dataset may be referred to as a local dataset, or simply as a dataset without qualification.

    Local datasets contain a shadow of the data of a bucket. A dataset is updated as the bucket that it shadows is updated. A dataset can contain the full content of the bucket or a filtered subset. Multiple datasets can shadow the same bucket. A dataset is created ON a bucket with an optional filter expression. The name of the dataset is optional and defaults to the name of the bucket. Creating a dataset fails if there is an existing dataset with the same name in the selected dataverse and IF NOT EXISTS is not specified.

    Remotely-Linked Datasets and External Datasets

    In Couchbase 6.6 and later, you can also create datasets on remote links and external links. These enable you to query data from outside the local Couchbase cluster.

    • A dataset on a remote link shadows the data from a remote cluster to a local dataset.

    • An external dataset reads data directly from an external source, such as Amazon S3, without shadowing it locally.

    Backup and restore only operates on datasets on the local link. Only the dataset definition is backed up or restored, not the actual data. Backup and restore excludes datasets on remote links and external links entirely.

    Before you can create an external dataset or a dataset on a remote link, you must first create the appropriate link for your dataset to use.

    In Analytics, a link represents a connection to a data source. The default type of link is a Local link.

    The Local link represents a connection to the Data service on the local Couchbase cluster, specifically to all buckets available to a specific user on the local cluster. Every dataverse implicitly contains a Local link for all buckets in the local cluster that are available to the current user.

    When a bucket has one or more shadow datasets and the Local link is connected, you cannot create a new dataset to shadow that bucket, or drop a dataset which shadows that bucket. However, you can always create a new dataset on any bucket which does not currently have a shadow dataset, even if the Local link is connected. In this case, data ingestion to the new shadow dataset does not start at once. To start ingestion to the new shadow dataset, you must disconnect the Local link and connect it again.

    In Couchbase 6.6 and later, you can also create remote links and external links. These enable you to access data from outside the local Couchbase cluster.

    • A remote link is a link to a remote Couchbase cluster.

    • An external link is a link to an external data source, such as Amazon S3.

    After you have created a remote link or an external link, you must create a dataset on that link to query the data.

    To create, edit, or delete a remote or external link, you must use the command-line interface or the REST API. Refer to couchbase-cli analytics-link-setup or Analytics Links REST API.

    Indexes

    An index is a materialized access path for data in a dataset. You can create more than one index on the same dataset. Each index name must be unique within a dataset. Creating an index fails if there is an existing index with the same name in the target dataset and IF NOT EXISTS is not specified.

    For each JSON document ingested into a dataset, the system computes the indexed key for each index. The index key of a secondary index is computed as follows:

    • The target field values are extracted from the JSON document according to the specified path of fields;

    • For non-numeric typed index fields (i.e., the specified type is STRING), if the specified type of the field is the same as the actual field value’s type, the value becomes part of the indexed key, otherwise the indexed key cannot be built;

    • For numeric typed index fields (i.e., the specified type is DOUBLE or BIGINT), the actual field value can be cast to the specified type and it becomes part of the indexed key. If the cast is impossible, the indexed key cannot be built;

    After the indexed key has been built, it is inserted into the secondary index. In case the index key cannot be built, there is no entry made in the index for this object.

    Secondary indexes are automatically maintained by the system during data ingestion (i.e., when their corresponding buckets are connected and data are flowing into the system). In addition, they are automatically rebalanced when their shadow datasets are rebalanced.

    Statements

    In addition to queries, N1QL for Analytics (the Couchbase Analytics implementation of SQL++) supports statements for data definition, and to connect Couchbase Analytics to Data Service buckets or external data sources.

    Statement       ::= ( ( SingleStatement )? ( ";" )+ )* <EOF>
    SingleStatement ::= CreateStatement
                      | UseStatement
                      | DropStatement
                      | ConnectStatement
                      | DisconnectStatement
                      | Query

    Create Statements

    The CREATE statement is used to create dataverses, datasets, and indexes.

    CreateStatement     ::= "CREATE" ( DataverseSpecification
                                     | DatasetSpecification
                                     | IndexSpecification )

    Dataverse Specification

    DataverseSpecification ::= "DATAVERSE" Identifier ( IfNotExists )?             ➊
    
    IfNotExists         ::= "IF" "NOT" "EXISTS"

    ➀ For dataverses, the Identifier is the dataverse name.

    Examples
    CREATE DATAVERSE `beer-data`;

    Creates a dataverse with the name beer-data to be used to manage other metadata entities.

    Dataset Specification

    DatasetSpecification             ::= ( LocalDatasetSpecification
                                         | RemoteLinkedDatasetSpecification
                                         | ExternalDatasetSpecification )
    LocalDatasetSpecification        ::= "DATASET" ( IfNotExists )?
                                         ( QualifiedName )?                        ➊
                                         "ON" Identifier                           ➋
                                         ( "WITH" ObjectValue )?                   ➌
                                         ( "WHERE" Expression )?
    RemoteLinkedDatasetSpecification ::= "DATASET" ( IfNotExists )?
                                         ( QualifiedName )?
                                         "ON" Identifier
                                         ( "WITH" ObjectValue )?
                                         ( "AT" Identifier )?                      ➍
                                         ( "WHERE" Expression )?
    ExternalDatasetSpecification     ::= "EXTERNAL" "DATASET" ( IfNotExists )?
                                         ( QualifiedName )?
                                         ( TypeDefinition )?                       ➎
                                         "ON" Identifier
                                         ( "WITH" ObjectValue )?
                                         ( "AT" Identifier
                                         ( "USING" Expression )? )?                ➏
                                         ( "WHERE" Expression )?
    
    QualifiedName        ::= Identifier ( "." Identifier )?
    
    ObjectValue          ::= "{" ( Pair ( "," Pair )* )? "}"
    Pair                 ::= StringLiteral ":" ParameterValue
    ParameterValue       ::= ObjectValue | ArrayValue | Literal
    ArrayValue           ::= "[" ( ParameterValue ( "," ParameterValue )* )? "]"
    
    TypeDefinition       ::= "(" Field ( "," Field )* ")"
    Field                ::= Identifier DataType ( "NOT" "UNKNOWN" )?
    DataType             ::= ( "BOOLEAN" | "BIGINT" | "INT" | "DOUBLE" | "STRING" )

    ➀ For datasets, the QualifiedName is the full name of the dataset. It consists of an optional dataverse name and a name for the dataset. If no dataverse name is given, the dataset is created in the Default dataverse, or in the dataverse specified by the immediately preceding USE statement. If the full name of the dataset is not specified, the bucket name is used as the dataset name.

    ➁ The Identifier in the ON clause represents the data source for the dataset. For a local dataset, this is the bucket which this dataset will shadow. For a dataset on a remote link, this is a bucket on the remote Couchbase cluster. For an external dataset, this is an external data source, such as an Amazon S3 bucket.

    ➂ Parameters for the dataset can be provided though the ObjectValue object after WITH. The ObjectValue contains key-value pairs, one for each parameter. The following parameters are available. Note that many of these parameters only apply when creating an external dataset.

    Name Description Schema

    storage-block-compression
    optional

    Determines the storage compression used for this dataset. This parameter takes a nested object value — scheme.

    Possible values for scheme are snappy or none. The default storage compression scheme is snappy.

    object

    format
    required

    (Only used when creating an external dataset.)

    Specifies the format of the external data. May be one of the following string values:

    json — JSON Lines
    csv — Comma-separated values
    tsv — Tab-separated values

    enum (json,csv,tsv)

    header
    required

    (Only used when creating an external dataset, and only if it has the format CSV or TSV.)

    When true, the first row of the file is skipped.

    boolean

    redact-warnings
    optional

    (Only used when creating an external dataset, and only if it has the format CSV or TSV.)

    When true, sensitive information (such as the file name) is redacted from warning messages.
    Default : false

    boolean

    null
    optional

    (Only used when creating an external dataset, and only if it has the format CSV or TSV.)

    The string used in the external dataset to represent a null value. You may escape special characters using a backslash \.

    (Refer also to the NOT UNKNOWN flag in the type definition, described below.)
    Example : "\\N"

    string

    include
    optional

    (Only used when creating an external dataset. May not be used if the exclude parameter is present.)

    The names of the files to be included from the external dataset. The file name may include subdirectories of the location specified by the USING clause, described below. Any files that do not match the specification are excluded.

    May contain the same wildcard characters as the exclude parameter.
    Example : ["*2018*.json", "*2019*.json"]

    string, or array of strings

    exclude
    optional

    (Only used when creating an external dataset. May not be used if the include parameter is present.)

    The names of the files to be excluded from the external dataset. The file name may include subdirectories of the location specified by the USING clause, described below. Any files that do not match the specification are included.

    May contain the following wildcard characters, in common with the include parameter:

    * — Matches anything
    ? — Matches any single character
    [ sequence ] — Matches any characters in sequence
    [! sequence ] — Matches any characters not in sequence
    Example : "*.?sv"

    string, or array of strings

    ➃ The AT clause is only used when creating a dataset on a remote link, or an external dataset. The Identifier in this clause is the name of the remote or external link on which the dataset is to be created.

    ➄ The TypeDefinition is only used when creating an external dataset, and only if it has the format CSV or TSV. It consists of a comma-separated list of field definitions, which guides the transformation of each CSV or TSV record into a JSON object. Each field definition consists of:

    • A name to be assigned to the field.

    • The data type of the field. This may be any of the primitive data types, where INT is an alias for BIGINT. If this field does not contain a value of this data type, the record is ignored.

    • Optionally, the NOT UNKNOWN flag. When this flag is present, if this field is missing or null, the record is ignored.

    Every time a record is ignored because it does not match the type definition, a warning is issued.

    ➅ The USING clause is only used when creating an external dataset. The expression in this clause must resolve to a string containing a path to the location of the data, relative to the external data source.

    Examples
    CREATE DATASET beers ON `beer-sample` WHERE `type` = "beer";
    
    CREATE DATASET breweries WITH {"storage-block-compression": {"scheme": "snappy"}}
    ON `beer-sample` WHERE `type` = "brewery";

    This example creates 2 datasets beers and breweries on the beer-sample bucket and filters the content for each dataset by the value of the type field of the record. For the breweries dataset, the snappy storage compression scheme is specified. Note that back-ticks are only necessary for reserved keywords, such as type, or certain operators, such as -.

    CREATE DATASET airports ON `travel-sample`
    AT myCbLink WHERE `type` = "airport";

    This example creates a dataset called airports. The dataset uses data from the travel-sample bucket at the remote Couchbase cluster which is connected via the remote link called myCbLink.

    CREATE EXTERNAL DATASET reviews
    ON BeerFestival
    AT `beer-data`.myAwsLink
    USING "json-data/reviews"
    WITH {
      "format": "json",
      "include": ["*2018*.json", "*2019*.json"]
    };

    This example creates an external dataset called reviews. The dataset uses data from the Amazon S3 bucket called BeerFestival, which is connected via the link called myAwsLink in the beer-data dataverse.

    The files containing the data are located in json-data/reviews and use JSON Lines format. The only files included are those with the extension .json, and with 2018 and 2019 as part of the file name.

    CREATE EXTERNAL DATASET outgoings(
        id INT NOT UNKNOWN,
        datetime STRING NOT UNKNOWN,
        amount DOUBLE NOT UNKNOWN
        details STRING,
      ) ON BeerFestival
    AT `beer-data`.myAwsLink
    USING "csv-data/accounts"
    WITH {
      "format": "csv",
      "header": true,
      "include": "201[5-9]/*[!234].csv"
    };

    This example creates an external dataset called outgoings. The dataset uses data from the Amazon S3 bucket called BeerFestival, which is connected on the link called myAwsLink in the beer-data dataverse.

    The files containing the data are located in csv-data/accounts and have CSV format. The only files included are those with the extension .csv, in subdirectories ranging from 2015 to 2019, and whose file name does not end with 2, 3, or 4.

    Each CSV file is expected to have a header row, which is excluded. In each row, the first column is expected to be a bigint which cannot be null, the second column is expected to be a string which cannot be null, the third column is expected to be a double which cannot be null, and the fourth column is expected to be a string. If any of the rows do not meet this data model, they will not be included. These fields will be given the names id, datetime, amount, and details respectively.

    Index Specification

    IndexSpecification ::= "INDEX" ( IfNotExists )? Identifier                     ➊
                           "ON" QualifiedName                                      ➋
                           "(" FieldSpecification ( ","  FieldSpecification )* ")" ➌
    
    FieldSpecification ::= NestedField ":" "STRING"|"BIGINT"|"DOUBLE"
    
    NestedField        ::= Identifier ( "." Identifier )*

    ➀ For indexes, the first Identifier is the index name.

    ➁ The QualifiedName identifies the dataset on which the index is built. It consists of an optional dataverse name and the name of the dataset. If no dataverse name is given, the dataset is assumed to be in the Default dataverse, or in the dataverse specified by the immediately preceding USE statement.

    ➂ The FieldSpecification consists of a NestedField that specifies a field path into the indexed JSON document, and a type identifier.

    Examples
    CREATE INDEX beers_name_idx on beers (name: string);
    CREATE INDEX breweries_name_idx on breweries (name: string);
    CREATE INDEX breweries_loc_idx on breweries (geo.lon: double, geo.lat: double);

    Creates indexes on the identified fields for the specified types.

    Use Statements

    The USE statement sets the Identifier to be the default dataverse for the following statement. If no USE statement is issued, the Default dataverse is used.

    UseStatement ::= "USE" Identifier

    The USE statement only works in a conjunction with another statement in a single request.

    Example
    USE `beer-data`;

    This example sets beer-data to be the default dataverse for the statement immediately following.

    Drop Statements

    The DROP statement is the inverse of the CREATE statement. It can be used to drop dataverses, datasets, and indexes.

    DropStatement ::= "DROP" ( "DATAVERSE" Identifier ( IfExists )?                ➊
                             | "DATASET" QualifiedName ( IfExists )?               ➋
                             | "INDEX" DoubleQualifiedName ( IfExists )? )         ➌
    
    DoubleQualifiedName ::= Identifier "." Identifier ( "." Identifier )?
    
    IfExists      ::= "IF" "EXISTS"

    ➀ For dataverses, the Identifier is the dataverse name.

    ➁ For datasets, the QualifiedName is the full name of the dataset. It consists of an optional dataverse name and the name of the dataset. If no dataverse name is given, the dataset is assumed to be in the Default dataverse, or in the dataverse specified by the immediately preceding USE statement.

    ➂ For indexes, the DoubleQualifiedName is the full name of the index. It consists of an optional dataverse name, the name of the dataset, and the name of the index. If no dataverse name is given, the dataset is assumed to be in the Default dataverse, or in the dataverse specified by the immediately preceding USE statement.

    Examples
    DROP DATAVERSE `beer-data` IF EXISTS;

    This example drops the beer-data dataverse, if it already exists. Dropping a dataverse disconnects any links and drops any datasets contained in it.

    DROP DATASET beers;
    DROP DATASET breweries;

    This example removes the datasets and all contained data.

    DROP INDEX beers.beers_name_idx;
    DROP INDEX breweries.breweries_name_idx;
    DROP INDEX breweries.breweries_loc_idx;

    Drops the indexes.

    Connect Statements

    The CONNECT statement connects all datasets in the given dataverse(s) to the buckets on which the datasets were created.

    ConnectStatement  ::= "CONNECT" "LINK" LinkSpecification                       ➊
                          ( "WITH" ObjectValue )                                   ➋
    
    LinkSpecification ::= ( Identifier "." )? Identifier
                          ( "," ( Identifier "." )? Identifier )*

    ➀ For links, the LinkSpecification is the full name of the link. It consists of (optionally) the name of the dataverse to connect to, and the name of the link. For the local link, the name of the link is Local; for remote links, it is the name given to the link when it was created. If no dataverse name is given, the link is created in the Default dataverse, or in the dataverse specified by the immediately preceding USE statement.

    ➁ Parameters for the connection can be provided though the ObjectValue object after WITH. The ObjectValue contains key-value pairs, one for each parameter.

    Only one parameter is currently supported for the connection: the force parameter, which takes a Boolean value (true / false) — default false. This parameter determines the behavior of CONNECT LINK if there has been a change in the bucket’s UUID, i.e. the bucket has been deleted and recreated with the same name.

    If force is false, then CONNECT LINK fails — this is the default behavior. If force is true, CONNECT LINK proceeds: Analytics deletes all existing data in the dataset and ingests all data from the bucket again.

    You do not need to use the CONNECT statement to connect an external dataset. An external dataset is available for query as soon as you create it.
    Examples
    CONNECT LINK Local;

    This example connects all datasets that use the link Local in the current dataverse to their Data Service buckets, and starts shadowing.

    CONNECT LINK myCbLink;

    This example connects all datasets on the remote link called myCbLink and starts shadowing.

    Disconnect Statements

    The DISCONNECT statement is the inverse of the CONNECT statement. It disconnects all datasets in the given dataverse(s).

    DisconnectStatement ::= "DISCONNECT" "LINK" LinkSpecification

    The DISCONNECT statement is not applicable to external datasets.

    Examples
    DISCONNECT LINK Local;

    This example stops shadowing for all datasets that use the link Local in the current dataverse, and disconnects their Data Service buckets.

    DISCONNECT LINK myCbLink;

    This example stops shadowing and disconnects all datasets on the remote link called myCbLink.

    Metadata Introspection

    Metadata entities can be introspected by querying the datasets in the Metadata dataverse. The Metadata dataverse contains the datasets Dataverse, Dataset, Index and Link. Each dataset contains the queryable metadata for each entity type.

    SELECT VALUE d.DataverseName || '.' || d.DatasetName
    FROM Metadata.`Dataset` d
    WHERE d.DataverseName <> "Metadata"

    Returns the qualified names of all datasets that are not in the Metadata dataverse. While dataverses, datasets, and indexes are created and removed by the corresponding CREATE and DROP statements, and remote links and external links are created and removed by the command-line interface or the REST API, the lifecycle of local links is managed by the system — they are created and removed as needed.