CREATE an External Collection

  • Capella Columnar
  • reference
March 16, 2025
+ 12
This topic describes how you use the CREATE statement to create a collection so that you can query OLAP data on an external data source.
 CreateRemoteCollection | CreateExternalCollection | CreateStandaloneCollection

To create a link to an external data source, you use the Capella Columnar UI. See Set Up an External Data Source.

Syntax

CreateExternalCollection EBNF
EBNF
CreateExternalCollection ::= "CREATE" "EXTERNAL" "ANALYTICS"? "COLLECTION" ("IF" "NOT" "EXISTS")? QualifiedName CollectionTypeDef? "ON" Identifier "AT" LinkName ( ( 'PATH' | 'USING' ) StringLiteral )? "WITH" ObjectConstructor
CreateExternalCollection Diagram
CREATE
You must include the keyword EXTERNAL when creating this type of collection.
Show QualifiedName Diagram
(DatabaseAndScopeName ".")? Identifier
QualifiedName
Show CollectionTypeDef Diagram
"(" CollectionFieldDef ("," CollectionFieldDef )* ")"
CollectionTypeDef
Show CollectionFieldDef Diagram
Identifier ( "BOOLEAN" | "BIGINT" | "INT" | "DOUBLE" | "STRING" ) ( "NOT" "UNKNOWN" )?
CollectionFieldDef

Examples

The following examples create external Capella Columnar collections for data stored in cloud object stores.

Use backtick characters (``) to delimit identifiers that include the - operator symbol.
SQL++
CREATE EXTERNAL COLLECTION music.myPlaylist.countrySongs ON `data-music` AT musicLink PATH "music/myPlaylist/countrySongs" WITH {"format": "json"};
Show another example
SQL++
CREATE EXTERNAL COLLECTION music.myPlaylist.rockSongs ON `data-music` AT musicLink PATH "music/myPlaylist/rockSongs" WITH {"format": "json"};

Arguments

ON

The ON clause identifies the bucket name on the external data source, such as an Amazon S3 bucket or Google Cloud Storage (GCS). Supply only the name of the bucket, not a URL.

AT

The AT clause specifies the name of the link that contains credentials for the cloud object stores. The specified link must have the type of cloud object store.

PATH

The PATH, or USING, clause is a string that specifies the location of the data files relative to the bucket name provided by the ON clause. Do not supply a filename as part of the path.

For example, the path can contain one or more cloud object stores prefixes delimited by slashes /, such as:

PATH "music/myPlaylist/rockSongs"

For your query to include files located at the bucket, or top, level of the storage organization, you supply an empty path:

PATH ""

External collections use the provided PATH to query the files in the specified location in the external object store. You can use the PATH to point to a desired subset of data only, leading to better performance as Capella Columnar does not query any files outside the provided PATH. For information about using dynamic prefixes in the path, which give you the ability to specify the exact prefix name in the WHERE clauses of your queries, see Design a Location Path and Optimize Performance of External Analytics Collections in Couchbase.

To specify particular filenames in the path to query, you can supply either an include or an exclude parameter in the WITH clause.

WITH

The WITH clause enables you to specify parameters for the collection. Its ObjectConstructor represents an object containing key-value pairs, one for each parameter. You can define the following parameters.

Name Description Schema

format
Required

Specifies the format of the external data. Accepts one of the following string values:

json — JSON Lines
csv — Comma-separated values
tsv — Tab-separated values
parquet — Apache Parquet
avro — Apache Avro

For files that have a format of CSV or TSV, you also supply a list of field names and data types in the CollectionTypeDef, described below. For information about the mapping that Capella Columnar performs for Avro and Parquet data types, see Data Type Mapping.

enum: json, csv, tsv, parquet, avro

table format
Required

An external dataset pointing to a Deltalake Table. This enables you to create external datasets based on Delta Lake tables.

"table-format": "delta"

Parquet

decimal-to-double
Optional

Delta DECIMAL values are converted to doubles, with the possibility of precision loss. The flag decimal-to-double must be set upon creating the dataset.

Boolean

timestamp-to-long
Optional

By default, Delta timestamps are converted to long. Set this flag to false to parse them as datetime.

Boolean

date-to-int
Optional

By default, Delta date are converted to int. Set this flag to false to parse them as date.

Boolean

timezone
Optional

The timezone flag adjusts temporal types (e.g., DATETIME) stored in UTC within Delta table to a specified local timezone (e.g., PST).

String

header
Required

Only used if the format is CSV or TSV.

When true, skip the first row of the file.

Boolean

redact-warnings
Optional

Only used if the format is CSV or TSV.

When true, redact sensitive information—such as the filename—from warning messages.

Default : false

Boolean

null
Optional

Only used if the format is CSV or TSV.

The string used in the external collection to represent a null value. Use a backslash \ to escape special characters: see Literals.

See the NOT UNKNOWN flag in the CollectionTypeDef definition that follows.

Example : "\\N"

string

exclude
Optional

Applies only if the include parameter is not present.

The names of the files in the specified path to exclude from querying. The supplied value can include a prefix, or subdirectory, of the location specified by the PATH clause. Capella Columnar queries any files that do not match the specification.

You can use the following wildcard characters, in common with the include parameter:

* — Matches anything
? — Matches any single character
[ sequence ] — Matches any characters in sequence
[! sequence ] — Matches any characters not in sequence
Example : "*.?sv"

string, or array of strings

include
Optional

Applies only if the exclude parameter is not present.

The names of the files in the specified path to include in queries. The supplied value can include a prefix, or subdirectory, of the location specified by the PATH clause. Capella Columnar queries only files that match the specification.

You can use the same wildcard characters as for the exclude parameter.
Example : ["*2018*.json", "*2019*.json"]

string, or array of strings

CollectionTypeDef

You use the CollectionTypeDef if the files in the external data store have a format of CSV or TSV. It consists of a comma-separated list of field definitions and their desired field types. These definitions guide the transformation of each CSV or TSV record into a JSON object. Each field definition consists of:

  • The name to assign to the field.

  • The data type of the field. This can be any of the primitive data types, where INT is an alias for BIGINT. If the field does not contain a value of this data type, Capella Columnar ignores the record and issues a warning.

  • Optionally, the NOT UNKNOWN flag. When this flag is present, if this field is missing or null, Capella Columnar ignores the record.

Data in JSON, CSV, or TSV format can be in compressed gzip files, with the extension .gz or .gzip.

For more information about external collections, see Set Up an External Data Source.

Data Type Mapping: Parquet and Avro

For files that have a format of Parquet or Avro, Capella Columnar maps data types as follows:

Parquet/Avro Data Type Capella Columnar Data Type

INT

INT

LONG

INT

FLOAT

DOUBLE

DOUBLE

DOUBLE

BYTES

BINARY

STRING

STRING

BOOLEAN

BOOLEAN

NULL

NULL

Union (nullable)*

NULL/ANY

Record

OBJECT

Array

ARRAY

MAP

ARRAY of OBJECT in the form
[{"key": <mapKey>, "value": <mapValue>}, …​]

*Avro only