This section lists all supported DDL statements in Couchbase Analytics.
A user must have one of the following RBAC roles to be able to execute DDL statements: Admin, Cluster Admin, Analytics Admin, or Analytics Manager. |
No CREATE /DROP or CONNECT /DISCONNECT statement can
be executed while the cluster topology changes (e.g. during rebalance). The evaluation of such DDL statements will fail
and it can be reattempted after the topology change has completed.
|
Entities
These are the entities that are required to create shadow datasets for Data Service buckets.
Dataverses
A dataverse is a namespace for the other metadata entities (links, buckets, datasets, indexes).
There is one Default
dataverse that is available without creating it and that cannot be deleted.
Because the Default
dataverse is always available, creating a dataverse is optional.
Refer to Create Statements, Use Statements, and Drop Statements.
Datasets
In Analytics, a dataset is a collection of data within a dataverse. By default, a dataset is linked to the Data service of the local Couchbase cluster. This type of dataset may be referred to as a local dataset, or simply as a dataset without qualification.
Local datasets contain a shadow of the data of a bucket.
A dataset is updated as the bucket that it shadows is updated.
A dataset can contain the full content of the bucket or a filtered subset.
Multiple datasets can shadow the same bucket.
A dataset is created ON
a bucket with an optional filter expression.
The name of the dataset is optional and defaults to the name of the bucket.
Creating a dataset fails if there is an existing dataset with the same name
in the selected dataverse and IF NOT EXISTS
is not specified.
Refer to Create Statements and Drop Statements.
Remotely-Linked Datasets and External Datasets
In Couchbase 6.6 and later, you can also create datasets on remote links and external links. These enable you to query data from outside the local Couchbase cluster.
-
A dataset on a remote link shadows the data from a remote cluster to a local dataset.
-
An external dataset reads data directly from an external source, such as Amazon S3, without shadowing it locally.
Backup and restore only operates on datasets on the local link. Only the dataset definition is backed up or restored, not the actual data. Backup and restore excludes datasets on remote links and external links entirely. |
Before you can create an external dataset or a dataset on a remote link, you must first create the appropriate link for your dataset to use.
Refer to Remote Links and External Links, Create Statements, and Drop Statements.
Manage Datasets with the Analytics Workbench
In Couchbase Server 6.6.1 and later, you can use the Analytics Workbench to create or drop datasets on local links, remote links, and external links; and to connect or disconnect local links and remote links. Refer to Managing Datasets.
Links
In Analytics, a link represents a connection to a data source.
The default type of link is a Local
link.
The Local
link represents a connection to the Data service on the local Couchbase cluster,
specifically to all buckets available to a specific user on the local cluster.
Every dataverse implicitly contains a Local
link for all buckets in the local cluster that are available to the
current user.
When a bucket has one or more shadow datasets and the Local link is connected,
you cannot create a new dataset to shadow that bucket, or drop a dataset which shadows that bucket.
However, you can always create a new dataset on any bucket which does not currently have a shadow dataset,
even if the Local link is connected.
In this case, data ingestion to the new shadow dataset does not start at once.
To start ingestion to the new shadow dataset, you must disconnect the Local link and connect it again.
|
Refer to Connect Statements and Disconnect Statements.
Remote Links and External Links
In Couchbase 6.6 and later, you can also create remote links and external links. These enable you to access data from outside the local Couchbase cluster.
-
A remote link is a link to a remote Couchbase cluster.
-
An external link is a link to an external data source, such as Amazon S3.
After you have created a remote link or an external link, you must create a dataset on that link to query the data.
To create, edit, or delete a remote or external link, you must use the command-line interface or the REST API. Refer to couchbase-cli analytics-link-setup or Analytics Links REST API.
Manage Links with the Analytics Workbench
In Couchbase Server 6.6.1 and later, you can use the Analytics Workbench to create or edit remote links and external links. Refer to Managing Links.
Indexes
An index is a materialized access path for data in a dataset.
You can create more than one index on the same dataset.
Each index name must be unique within a dataset.
Creating an index fails if there is an existing index with the same name
in the target dataset and IF NOT EXISTS
is not specified.
For each JSON document ingested into a dataset, the system computes the indexed key for each index. The index key of a secondary index is computed as follows:
-
The target field values are extracted from the JSON document according to the specified path of fields;
-
For non-numeric typed index fields (i.e., the specified type is
STRING
), if the specified type of the field is the same as the actual field value’s type, the value becomes part of the indexed key, otherwise the indexed key cannot be built; -
For numeric typed index fields (i.e., the specified type is
DOUBLE
orBIGINT
), the actual field value can be cast to the specified type and it becomes part of the indexed key. If the cast is impossible, the indexed key cannot be built;
After the indexed key has been built, it is inserted into the secondary index. In case the index key cannot be built, there is no entry made in the index for this object.
Secondary indexes are automatically maintained by the system during data ingestion (i.e., when their corresponding buckets are connected and data are flowing into the system). In addition, they are automatically rebalanced when their shadow datasets are rebalanced.
Refer to Create Statements and Drop Statements.
Statements
In addition to queries, N1QL for Analytics (the Couchbase Analytics implementation of SQL++) supports statements for data definition, and to connect Couchbase Analytics to Data Service buckets or external data sources.
Statement ::= ( ( SingleStatement )? ( ";" )+ )* <EOF> SingleStatement ::= CreateStatement | UseStatement | DropStatement | ConnectStatement | DisconnectStatement | Query
Create Statements
The CREATE statement is used to create dataverses, datasets, and indexes.
CreateStatement ::= "CREATE" ( DataverseSpecification | DatasetSpecification | IndexSpecification )
Dataverse Specification
DataverseSpecification ::= "DATAVERSE" Identifier ( IfNotExists )? ➊ IfNotExists ::= "IF" "NOT" "EXISTS"
➀ For dataverses, the Identifier
is the dataverse name.
Dataset Specification
DatasetSpecification ::= ( LocalDatasetSpecification | RemoteLinkedDatasetSpecification | ExternalDatasetSpecification ) LocalDatasetSpecification ::= "DATASET" ( IfNotExists )? ( QualifiedName )? ➊ "ON" Identifier ➋ ( "WITH" ObjectValue )? ➌ ( "WHERE" Expression )? ➍ RemoteLinkedDatasetSpecification ::= "DATASET" ( IfNotExists )? ( QualifiedName )? "ON" Identifier ( "WITH" ObjectValue )? ( "AT" Identifier )? ➎ ( "WHERE" Expression )? ExternalDatasetSpecification ::= "EXTERNAL" "DATASET" ( IfNotExists )? ( QualifiedName )? ( TypeDefinition )? ➏ "ON" Identifier ( "WITH" ObjectValue )? ( "AT" Identifier ( "USING" Expression )? )? ➐ ( "WHERE" Expression )? QualifiedName ::= Identifier ( "." Identifier )? ObjectValue ::= "{" ( Pair ( "," Pair )* )? "}" Pair ::= StringLiteral ":" ParameterValue ParameterValue ::= ObjectValue | ArrayValue | Literal ArrayValue ::= "[" ( ParameterValue ( "," ParameterValue )* )? "]" TypeDefinition ::= "(" Field ( "," Field )* ")" Field ::= Identifier DataType ( "NOT" "UNKNOWN" )? DataType ::= ( "BOOLEAN" | "BIGINT" | "INT" | "DOUBLE" | "STRING" )
➀ For datasets, the QualifiedName
is the full name of the dataset.
It consists of an optional dataverse name and a name for the dataset.
If no dataverse name is given, the dataset is created in the Default
dataverse,
or in the dataverse specified by the immediately preceding USE statement.
If the full name of the dataset is not specified, the bucket name is used as the dataset name.
➁ The Identifier
in the ON
clause represents the data source for the dataset.
For a local dataset, this is the bucket which this dataset will shadow.
For a dataset on a remote link, this is a bucket on the remote Couchbase cluster.
For an external dataset, this is an external data source, such as an Amazon S3 bucket.
➂ Parameters for the dataset can be provided though the ObjectValue
object after WITH
.
The ObjectValue
contains key-value pairs, one for each parameter.
The following parameters are available.
Note that many of these parameters only apply when creating an external dataset.
Name | Description | Schema |
---|---|---|
storage-block-compression |
Determines the storage compression used for this dataset. This parameter takes a nested object value — |
object |
format |
(Only used when creating an external dataset.) |
enum (json,csv,tsv) |
header |
(Only used when creating an external dataset, and only if it has the format CSV or TSV.) |
boolean |
redact-warnings |
(Only used when creating an external dataset, and only if it has the format CSV or TSV.) |
boolean |
null |
(Only used when creating an external dataset, and only if it has the format CSV or TSV.) |
string |
include |
(Only used when creating an external dataset. May not be used if the exclude parameter is present.) |
string, or array of strings |
exclude |
(Only used when creating an external dataset. May not be used if the include parameter is present.) |
string, or array of strings |
➃ The WHERE
clause enables you to filter the documents in the dataset.
The Expression
in this clause must be deterministic.
➄ The AT
clause is only used when creating a dataset on a remote link, or an external dataset.
The Identifier
in this clause is the name of the remote or external link on which the dataset is to be created.
➅ The TypeDefinition
is only used when creating an external dataset,
and only if it has the format CSV or TSV.
It consists of a comma-separated list of field definitions,
which guides the transformation of each CSV or TSV record into a JSON object.
Each field definition consists of:
-
A name to be assigned to the field.
-
The data type of the field. This may be any of the primitive data types, where
INT
is an alias forBIGINT
. If this field does not contain a value of this data type, the record is ignored. -
Optionally, the
NOT UNKNOWN
flag. When this flag is present, if this field ismissing
ornull
, the record is ignored.
Every time a record is ignored because it does not match the type definition, a warning is issued.
➆ The USING
clause is only used when creating an external dataset.
The expression in this clause must resolve to a string containing a path to the location of the data,
relative to the external data source.
Examples
CREATE DATASET beers ON `beer-sample` WHERE `type` = "beer"; CREATE DATASET breweries WITH {"storage-block-compression": {"scheme": "snappy"}} ON `beer-sample` WHERE `type` = "brewery";
This example creates 2 datasets beers
and breweries
on the beer-sample
bucket and filters
the content for each dataset by the value of the type
field of the record.
For the breweries
dataset, the snappy
storage compression scheme is specified.
Note that back-ticks are only necessary for reserved keywords, such as type
, or certain operators, such as -
.
CREATE DATASET airports ON `travel-sample` AT myCbLink WHERE `type` = "airport";
This example creates a dataset called airports
.
The dataset uses data from the travel-sample
bucket at the remote
Couchbase cluster which is connected via the remote link called myCbLink
.
CREATE EXTERNAL DATASET reviews ON BeerFestival AT `beer-data`.myAwsLink USING "json-data/reviews" WITH { "format": "json", "include": ["*2018*.json", "*2019*.json"] };
This example creates an external dataset called reviews
.
The dataset uses data from the Amazon S3 bucket called BeerFestival
,
which is connected via the link called myAwsLink
in the beer-data
dataverse.
The files containing the data are located in json-data/reviews
and use JSON Lines format.
The only files included are those with the extension .json
, and with 2018
and 2019
as part of the file name.
CREATE EXTERNAL DATASET outgoings( id INT NOT UNKNOWN, datetime STRING NOT UNKNOWN, amount DOUBLE NOT UNKNOWN details STRING, ) ON BeerFestival AT `beer-data`.myAwsLink USING "csv-data/accounts" WITH { "format": "csv", "header": true, "include": "201[5-9]/*[!234].csv" };
This example creates an external dataset called outgoings
.
The dataset uses data from the Amazon S3 bucket called BeerFestival
,
which is connected on the link called myAwsLink
in the beer-data
dataverse.
The files containing the data are located in csv-data/accounts
and have CSV format.
The only files included are those with the extension .csv
, in subdirectories ranging from 2015
to 2019
, and whose file name does not end with 2
, 3
, or 4
.
Each CSV file is expected to have a header row, which is excluded.
In each row, the first column is expected to be a bigint which cannot be null,
the second column is expected to be a string which cannot be null,
the third column is expected to be a double which cannot be null,
and the fourth column is expected to be a string.
If any of the rows do not meet this data model, they will not be included.
These fields will be given the names id
, datetime
, amount
, and details
respectively.
Index Specification
IndexSpecification ::= "INDEX" ( IfNotExists )? Identifier ➊ "ON" QualifiedName ➋ "(" FieldSpecification ( "," FieldSpecification )* ")" ➌ FieldSpecification ::= NestedField ":" "STRING"|"BIGINT"|"DOUBLE" NestedField ::= Identifier ( "." Identifier )*
➀ For indexes, the first Identifier
is the index name.
➁ The QualifiedName
identifies the dataset on which the index is built.
It consists of an optional dataverse name and the name of the dataset.
If no dataverse name is given, the dataset is assumed to be in the Default
dataverse,
or in the dataverse specified by the immediately preceding USE statement.
➂ The FieldSpecification
consists of a NestedField
that specifies a field path into the indexed JSON document,
and a type identifier.
Use Statements
The USE statement sets the Identifier
to be the default dataverse for the following statement.
If no USE statement is issued, the Default
dataverse is used.
UseStatement ::= "USE" Identifier
The USE statement only works in a conjunction with another statement in a single request.
USE `beer-data`;
This example sets beer-data
to be the default dataverse for the statement immediately following.
Drop Statements
The DROP statement is the inverse of the CREATE statement. It can be used to drop dataverses, datasets, and indexes.
DropStatement ::= "DROP" ( "DATAVERSE" Identifier ( IfExists )? ➊ | "DATASET" QualifiedName ( IfExists )? ➋ | "INDEX" DoubleQualifiedName ( IfExists )? ) ➌ DoubleQualifiedName ::= Identifier "." Identifier ( "." Identifier )? IfExists ::= "IF" "EXISTS"
➀ For dataverses, the Identifier
is the dataverse name.
➁ For datasets, the QualifiedName
is the full name of the dataset.
It consists of an optional dataverse name and the name of the dataset.
If no dataverse name is given, the dataset is assumed to be in the Default
dataverse,
or in the dataverse specified by the immediately preceding USE statement.
➂ For indexes, the DoubleQualifiedName
is the full name of the index.
It consists of an optional dataverse name, the name of the dataset, and the name of the index.
If no dataverse name is given, the dataset is assumed to be in the Default
dataverse,
or in the dataverse specified by the immediately preceding USE statement.
DROP DATAVERSE `beer-data` IF EXISTS;
This example drops the beer-data
dataverse, if it already exists.
Dropping a dataverse disconnects any links and drops any datasets contained in it.
DROP DATASET beers; DROP DATASET breweries;
This example removes the datasets and all contained data.
DROP INDEX beers.beers_name_idx; DROP INDEX breweries.breweries_name_idx; DROP INDEX breweries.breweries_loc_idx;
Drops the indexes.
Connect Statements
The CONNECT statement connects all datasets in the given dataverse(s) to the buckets on which the datasets were created.
ConnectStatement ::= "CONNECT" "LINK" LinkSpecification ➊ ( "WITH" ObjectValue ) ➋ LinkSpecification ::= ( Identifier "." )? Identifier ( "," ( Identifier "." )? Identifier )*
➀ For links, the LinkSpecification
is the full name of the link.
It consists of (optionally) the name of the dataverse to connect to, and the name of the link.
For the local link, the name of the link is Local
; for remote links,
it is the name given to the link when it was created.
If no dataverse name is given, the link is created in the Default
dataverse,
or in the dataverse specified by the immediately preceding USE statement.
➁ Parameters for the connection can be provided though the ObjectValue
object after WITH
.
The ObjectValue
contains key-value pairs, one for each parameter.
Only one parameter is currently supported for the connection:
the force
parameter, which takes a Boolean value (true / false) — default false.
This parameter determines the behavior of CONNECT LINK if there has been a change in the bucket’s UUID,
i.e. the bucket has been deleted and recreated with the same name.
If force
is false, then CONNECT LINK fails — this is the default behavior.
If force
is true, CONNECT LINK proceeds:
Analytics deletes all existing data in the dataset and ingests all data from the bucket again.
You do not need to use the CONNECT statement to connect an external dataset. An external dataset is available for query as soon as you create it. |
CONNECT LINK Local;
This example connects all datasets that use the link Local
in the current dataverse to their Data Service buckets, and starts shadowing.
CONNECT LINK myCbLink;
This example connects all datasets on the remote link called myCbLink
and starts shadowing.
Disconnect Statements
The DISCONNECT statement is the inverse of the CONNECT statement. It disconnects all datasets in the given dataverse(s).
DisconnectStatement ::= "DISCONNECT" "LINK" LinkSpecification
The DISCONNECT statement is not applicable to external datasets.
DISCONNECT LINK Local;
This example stops shadowing for all datasets that use the link Local
in the current dataverse, and disconnects their Data Service buckets.
DISCONNECT LINK myCbLink;
This example stops shadowing and disconnects all datasets on the remote link called myCbLink
.
Metadata Introspection
Metadata entities can be introspected by querying the datasets in the Metadata
dataverse.
The Metadata dataverse contains the datasets Dataverse
, Dataset
, Index
and Link
.
Each dataset contains the queryable metadata for each entity type.
SELECT VALUE d.DataverseName || '.' || d.DatasetName FROM Metadata.`Dataset` d WHERE d.DataverseName <> "Metadata"
Returns the qualified names of all datasets that are not in the Metadata
dataverse.
While dataverses, datasets, and indexes are created and removed by the corresponding CREATE and DROP statements,
and remote links and external links are created and removed by the command-line interface or the REST API,
the lifecycle of local links is managed by the system — they are created and removed as needed.