Parquet

Capella Analytics

reference

This topic explains how to use the COPY TO statement to export data from a database to Amazon S3 or Google Cloud Storage (GCS) in Parquet format.

Having results in Parquet format is highly efficient due to its Capella Analytics storage, which enables better compression and faster query performance.

Parquet files takes less storage space, making it cost-effective for large-scale data storage and analytics.

The optimized format is for read-heavy workloads making it ideal for big data processing, data warehousing, and machine learning pipelines.

For more information, see Query Data in Amazon S3 and Query Data in Google Cloud Storage (GCS).

To be able to read or write data to or from external cloud storage, exclusive permissions are required. For more information see Cloud Read/Write Permissions.

Supported Methods

You can copy data using one of the two methods:

User-Defined Schema: It’s a schema explicitly provided by the user in the COPY TO statement when the structure of the result or collection is known.
Schema Inference: The system infers the schema from the data.

Syntax

CopyTo EBNF

CopyTo ::= "COPY" SourceDefinition
           "TO" ContainerIdentifier
           "AT" LinkQualifiedName
           OutputClause WithClause

CopyTo Diagram

"COPY" SourceDefinition "TO" ContainerIdentifier "AT" LinkName OutputClause WithClause

Show SourceDefinition Diagram

( QualifiedName | "(" Query ")" ) ("AS"? AliasIdentifier )?

SourceDefinition

Show OutputClause Diagram

"PATH" "(" OutputPathExpr ("," OutputPathExpr)* ")" OverClause? SchemaClause?

OutputClause

Show OutputPathExpr Diagram

OutputPathExpr

Show OverClause Diagram

"OVER" "(" PartitionClause? OrderClause? ")"

OverClause

Show SchemaClause Diagram

SchemaClause

Show TypeExpression Diagram

TypeExpression

Show ObjectTypeDefinition Diagram

ObjectTypeDefinition

Show ArrayTypeDefinition Diagram

ArrayTypeDefinition

Show ObjectField Diagram

ObjectField

Show PartitionClause Diagram

"PARTITION" "BY" PartitionExpr ("," PartitionExpr)*

PartitionClause

Show PartitionExpr Diagram

PartitionExpr

Show OrderClause Diagram

OrderClause

Show OrderExpr Diagram

Expression ("ASC" | "DESC")? ( "NULLS" ( "FIRST" | "LAST" ) )?

OrderExpr

Show WithClause Diagram

"WITH" "{" NameValuePair ("," NameValuePair )* "}"

WithClause

Show NameValuePair Diagram

NameValuePair

Examples

You can copy the Customer Data from Example Data in parquet format with gzip compression.

Example 1: Syntax for User-Defined Schema

  COPY (
    SELECT * from customers
  ) AS t
  TO myS3Bucket AT myS3Link
    PATH("commerce/Customers/zip=" || zip || "/")
    OVER(PARTITION BY t.address.zipcode AS zip)
  TYPE (
{
 orderno : int ,
 custid : string,
 order_date : date,
 ship_date : date,
 items : [ { itemno:int, qty : int, price : double } ]
}
  )
  WITH {
    "format": "parquet"  ,
    "compression": "gzip"
  }

Example 2: Syntax for Schema Inference

  COPY (
    SELECT * from customers
  ) AS t
  TO myS3Bucket AT myS3Link
    PATH("commerce/Customers/zip=" || zip || "/")
    OVER(PARTITION BY t.address.zipcode AS zip)
  WITH {
    "format": "parquet",
    "compression": "gzip"
  }

Arguments

SourceDefinition

As the source, you specify either the fully qualified name of a collection or provide a query.

If you specify a collection name, then the whole collection—or view or synonym—is the source of data to copy.
If you specify a query, then the result of that query is the source of data.

TO: The TO clause identifies the bucket name on the external data source, an Amazon S3 bucket in this case.

AT: The AT clause specifies the name of the link that contains credentials for the S3 bucket name. The specified link must have a type of S3.

OutputClause: The OutputClause defines the destination path for the output objects. You supply one or more OutputPathExpr expressions to identify the path prefixes. You can include aliases. If you supply more than one expression, Capella Analytics concatenates the values of all OutputPathExpr and supplies / characters as the path separators. As a result, you do not need to include slash characters between OutputPathExpr expressions.

The target directory that you specify in the destination path must be empty. The operation fails if the target directory is not empty.

OverClause

You supply an optional OverClause to specify output partitioning with a PartitionClause and ordering with an OrderClause. This is similar to the OVER clause of a WINDOW statement.

If you specify a PartitionClause, Capella Analytics evaluates the Output_Path_Expr once per logical data partition and refers to aliases if defined by an AS sub-clause.
If you do not specify a PartitionClause, Capella Analytics evaluates the OutputPathExpr once for the whole COPY TO output dataset. That’s, all of the files end up in the same directory.

You use the OrderClause to define output object order, either within each partition or for the whole dataset.

SchemaClause: The SchemaClause defines the schema for the output Parquet files. You specify the schema using a JSON-like format: { field-name1: type1, field-name2: type2, … }. The types can be flat types, array types, or object types. Supported flat types are listed in the Supported Types section.

WITH: You use the WITH clause to specify the following additional parameters.

Name	Description	Schema
format (optional)	Allowed values: "parquet" Default: JSON	String Enum
max-objects-per-file	Maximum number of objects per file. Default: 10000.	int
max-schemas(optional)	Maximum number of heterogeneous schemas allowed. This value cannot be greater than 10. This value is valid only for schemaless mode (with TYPE not provided). Default : 5	int
compression (Optional)	Allowed values: "none", "snappy", "gzip","lzo","brotli","lz4","zstd" Default: none	String enum
row-group-size(Optional)	Row Group Size in parquet file in byte values. Default: 10 MB	String
page-size(Optional)	Page Size in parquet file in byte values. Default: 8 KB	String
version(Optional)	Parquet Writer Version. Allowed values : 1,2 Default: 1	String Enum

Name

Description

Schema

format (optional)

Allowed values: "parquet"

Default: JSON

String Enum

max-objects-per-file

Maximum number of objects per file.

Default: 10000.

int

max-schemas(optional)

Maximum number of heterogeneous schemas allowed.

This value cannot be greater than 10.

This value is valid only for schemaless mode (with TYPE not provided).

Default : 5

int

compression (Optional)

Allowed values: "none", "snappy", "gzip","lzo","brotli","lz4","zstd"

Default: none

String enum

row-group-size(Optional)

Row Group Size in parquet file in byte values.

Default: 10 MB

String

page-size(Optional)

Page Size in parquet file in byte values.

Default: 8 KB

String

version(Optional)

Parquet Writer Version.

Allowed values : 1,2

Default: 1

String Enum

Supported Data Types

The supported types and the corresponding parquet types are in the following table:

Analytics Type	Parquet Type
boolean	BOOLEAN
string	BINARY(STRING)
tinyint, smallint, int	INT32
bigint	INT64
float	FLOAT
double	DOUBLE
date	INT32(DATE)
time	INT32(TIME_MILLIS)
datetime	INT64(TIMESTAMP_MILLIS)
UUID	FIXED_LEN_BYTE_ARRAY(UUID)

Analytics Type

Parquet Type

boolean

BOOLEAN

string

BINARY(STRING)

tinyint, smallint, int

INT32

bigint

INT64

float

FLOAT

double

DOUBLE

date

INT32(DATE)

time

INT32(TIME_MILLIS)

datetime

INT64(TIMESTAMP_MILLIS)

UUID

FIXED_LEN_BYTE_ARRAY(UUID)