CSV

  • Capella Analytics
  • reference
This topic explains how to use the COPY TO statement to export data from a database to Amazon S3 or Google Cloud Storage (GCS) in CSV format.

COPY TO CSV introduces the ability to write results of queries or entire collections to external stores (such as AWS S3 and Google Cloud Storage) in CSV format.

For more information, see Amazon S3 and Google Cloud Storage (GCS).

To be able to read or write data to or from external cloud storage exclusive permissions are required. For more information see Cloud Read/Write Permissions.

Syntax

CopyTo EBNF
CopyTo ::= "COPY" SourceDefinition
           "TO" ContainerIdentifier
           "AT" LinkQualifiedName
           OutputClause
           WithClause
Show SourceDefinition Diagram
( QualifiedName | "(" Query ")" ) ("AS"? AliasIdentifier )?
SourceDefinition
Show OutputClause Diagram
"PATH" "(" OutputPathExpr ("," OutputPathExpr)* ")" OverClause? SchemaClause?
OutputClause
Show OutputPathExpr Diagram
Expr
OutputPathExpr
Show OverClause Diagram
"OVER" "(" PartitionClause? OrderClause? ")"
OverClause
Show PartitionClause Diagram
"PARTITION" "BY" PartitionExpr ("," PartitionExpr)*
PartitionClause
Show PartitionExpr Diagram
Expression ("AS" AliasIdentifier)?
PartitionExpr
Show OrderClause Diagram
"ORDER" "BY"  OrderExpr ("," OrderExpr)*
OrderClause
Show OrderExpr Diagram
Expression ("ASC" | "DESC")? ( "NULLS" ( "FIRST" | "LAST" ) )?
OrderExpr
Show SchemaClause Diagram
"AS" CSV_Type_Expr
SchemaClause
Show CSVTypeExpr Diagram
(" Type_Expr ("," Type_Expr)* ")
CSVTypeExpr
Show TypeExpr Diagram
Field_Name Flat_Type (
Type_Expr
Show FlatType Diagram
(
FlatType
Show WithClause Diagram
WITH
WithClause
Show NameValuePair Diagram
NameStringLiteral ":" ValueLiteral
NameValuePair

Example

  COPY (
    SELECT o.custid, o.order_date, o.orderno, o.gender, o.zip
    FROM Orders o
  ) AS t
  TO myS3Bucket
  AT myS3Link
  PATH ("commerce/Orders/zip-" || zip || "/")
  OVER (PARTITION BY t.zip AS zip)
  AS (
      custid string NOT UNKNOWN,
      orderno int,
      order_date string,
      gender string,
      zip string
  )
  WITH {
	    "format": "csv",
	    "header": true
}

Arguments

SourceDefinition

As the source, you specify either the fully qualified name of a collection or provide a query.

  • If you specify a collection name, then the whole collection—or view or synonym—is the source of data to copy.

  • If you specify a query, then the result of that query is the source of data.

TO

The TO clause identifies the bucket name on the external data source, an Amazon S3 bucket in this case.

AT

The AT clause specifies the name of the link that contains credentials for the S3 bucket name. The specified link must have a type of S3.

OutputClause

The OutputClause defines the destination path for the output objects. You supply one or more OutputPathExpr expressions to identify the path prefixes. You can include aliases. If you supply more than one expression, Capella Analytics concatenates the values of all OutputPathExpr and supplies / characters as the path separators. As a result, you do not need to include slash characters between OutputPathExpr expressions.

The target directory that you specify in the destination path must be empty. The operation fails if the target directory is not empty.
OverClause

You supply an optional OverClause to specify output partitioning with a PartitionClause and ordering with an OrderClause. This is similar to the OVER clause of a WINDOW statement.

  • If you specify a PartitionClause, Capella Analytics evaluates the Output_Path_Expr once per logical data partition and refers to aliases if defined by an AS sub-clause.

  • If you do not specify a PartitionClause, Capella Analytics evaluates the OutputPathExpr once for the whole COPY TO output dataset. That is, all of the files end up in the same directory.

You use the OrderClause to define output object order, either within each partition or for the whole dataset.

AS (CSV Type Expression)

UNKNOWN refers to the field being null or missing.

  AS (
		custid STRING NOT UNKNOWN,
		orderno BIGINT,
		order_date DATE,
		gender STRING
	)
WITH

The WITH clause is optional. You use it to specify the following additional parameters.

Name Description Schema

max-objects-per-file (Optional)

Maximum number of objects per file.

Default: 10000.

Minimum allowed value: 1000.

String

compression (Optional)

Compression mechanism.

Default: none.

String enum (gz, gzip)

header(Optional)

Specifies if the header information must be written or not.

Default: false

Boolean

escape

Specifies the character used to escape the QUOTE and carriage return value.

Default: same as the QUOTE value (the quote character is displayed twice if it appears in the data)

String

quote

A character used to enclose strings.

Default: "

Value: NONE, ', "

String

null (Optional)

Emits text in CSV output for NULL or MISSING values.

Default: unquoted empty string

String

record-delimiter

Character that separates each row/line (record).

String

delimiter

Specifies the character that separates columns within each row (line) of the file.

Default: comma in CSV format

String

empty-field-as-null

Empty data values to be written by <null> string.

Default: false

Boolean

force-quote

A flag to enable or disable the quotes around the string non-null values.

Default: Special characters within the string value are always enclosed within quotes.

Boolean

Supported Data Types

Following are the supported data types that can be used in CSV schema:

  • Boolean

  • String

  • Bigint

  • Double

  • Null

  • Temporal Data (Date / Time / DateTime)

Unsupported Data Types

Following are the unsupported data types that cannot be used in CSV schema:

  • Object

  • Array

  • Multiset

Schema Mismatches and Warnings

In case of a schema mismatch between the expected schema versus the actual record’s schema, a warning is issued and returned as part of the final result. The system skips writing records with mismatches and continues to the next record.

Examples of schema mismatches:

  • The provided schema has 4 fields, but the actual record has 5 fields.

  • The type for the same field is different in the provided schema vs the field type in the record schema.

  • The actual record has fields that are not present in the provided schema.