Parquet

  • Capella Analytics
  • reference
This topic explains how to use the COPY TO statement to export data from a database to Amazon S3 or Google Cloud Storage (GCS) in Parquet format.

Having results in Parquet format is highly efficient due to its Capella Analytics storage, which enables better compression and faster query performance.

Parquet files takes less storage space, making it cost-effective for large-scale data storage and analytics.

The optimized format is for read-heavy workloads making it ideal for big data processing, data warehousing, and machine learning pipelines.

For more information, see Amazon S3 and Google Cloud Storage (GCS).

To be able to read or write data to or from external cloud storage, exclusive permissions are required. For more information see Cloud Read/Write Permissions.

Supported Methods

You can copy data using one of the two methods:

  1. User-Defined Schema: It’s a schema explicitly provided by the user in the COPY TO statement when the structure of the result or collection is known.

  2. Schema Inference: The system infers the schema from the data.

Syntax

CopyTo EBNF
CopyTo ::= "COPY" SourceDefinition
           "TO" ContainerIdentifier
           "AT" LinkQualifiedName
           OutputClause WithClause
CopyTo Diagram
"COPY" SourceDefinition "TO" ContainerIdentifier "AT" LinkName OutputClause WithClause
Show SourceDefinition Diagram
( QualifiedName | "(" Query ")" ) ("AS"? AliasIdentifier )?
SourceDefinition
Show OutputClause Diagram
"PATH" "(" OutputPathExpr ("," OutputPathExpr)* ")" OverClause? SchemaClause?
OutputClause
Show OutputPathExpr Diagram
Expr
OutputPathExpr
Show OverClause Diagram
"OVER" "(" PartitionClause? OrderClause? ")"
OverClause
Show SchemaClause Diagram
TYPE
SchemaClause
Show TypeExpression Diagram
ObjectTypeDef
TypeExpression
Show ObjectTypeDefinition Diagram
{
ObjectTypeDefinition
Show ArrayTypeDefinition Diagram
(
ArrayTypeDefinition
Show ObjectField Diagram
Identifier
ObjectField
Show PartitionClause Diagram
"PARTITION" "BY" PartitionExpr ("," PartitionExpr)*
PartitionClause
Show PartitionExpr Diagram
Expression ("AS" AliasIdentifier)?
PartitionExpr
Show OrderClause Diagram
"ORDER" "BY"  OrderExpr ("," OrderExpr)*
OrderClause
Show OrderExpr Diagram
Expression ("ASC" | "DESC")? ( "NULLS" ( "FIRST" | "LAST" ) )?
OrderExpr
Show WithClause Diagram
"WITH" "{" NameValuePair ("," NameValuePair )* "}"
WithClause
Show NameValuePair Diagram
NameStringLiteral ":" ValueLiteral
NameValuePair

Examples

You can copy the Customer Data from Example Data in parquet format with gzip compression.

Example 1: Syntax for User-Defined Schema

  COPY (
    SELECT * from customers
  ) AS t
  TO myS3Bucket AT myS3Link
    PATH("commerce/Customers/zip=" || zip || "/")
    OVER(PARTITION BY t.address.zipcode AS zip)
  TYPE (
{
 orderno : int ,
 custid : string,
 order_date : date,
 ship_date : date,
 items : [ { itemno:int, qty : int, price : double } ]
}
  )
  WITH {
    "format": "parquet"  ,
    "compression": "gzip"
  }

Example 2: Syntax for Schema Inference

  COPY (
    SELECT * from customers
  ) AS t
  TO myS3Bucket AT myS3Link
    PATH("commerce/Customers/zip=" || zip || "/")
    OVER(PARTITION BY t.address.zipcode AS zip)
  WITH {
    "format": "parquet",
    "compression": "gzip"
  }

Arguments

SourceDefinition

As the source, you specify either the fully qualified name of a collection or provide a query.

  • If you specify a collection name, then the whole collection—or view or synonym—is the source of data to copy.

  • If you specify a query, then the result of that query is the source of data.

TO

The TO clause identifies the bucket name on the external data source, an Amazon S3 bucket in this case.

AT

The AT clause specifies the name of the link that contains credentials for the S3 bucket name. The specified link must have a type of S3.

OutputClause

The OutputClause defines the destination path for the output objects. You supply one or more OutputPathExpr expressions to identify the path prefixes. You can include aliases. If you supply more than one expression, Capella Analytics concatenates the values of all OutputPathExpr and supplies / characters as the path separators. As a result, you do not need to include slash characters between OutputPathExpr expressions.

The target directory that you specify in the destination path must be empty. The operation fails if the target directory is not empty.
OverClause

You supply an optional OverClause to specify output partitioning with a PartitionClause and ordering with an OrderClause. This is similar to the OVER clause of a WINDOW statement.

  • If you specify a PartitionClause, Capella Analytics evaluates the Output_Path_Expr once per logical data partition and refers to aliases if defined by an AS sub-clause.

  • If you do not specify a PartitionClause, Capella Analytics evaluates the OutputPathExpr once for the whole COPY TO output dataset. That’s, all of the files end up in the same directory.

You use the OrderClause to define output object order, either within each partition or for the whole dataset.

SchemaClause

The SchemaClause defines the schema for the output Parquet files. You specify the schema using a JSON-like format: { field-name1: type1, field-name2: type2, …​ }. The types can be flat types, array types, or object types. Supported flat types are listed in the Supported Types section.

WITH

You use the WITH clause to specify the following additional parameters.

Name Description Schema

format (optional)

Allowed values: "parquet"

Default: JSON

String Enum

max-objects-per-file

Maximum number of objects per file.

Default: 10000.

int

max-schemas(optional)

Maximum number of heterogeneous schemas allowed.

This value cannot be greater than 10.

This value is valid only for schemaless mode (with TYPE not provided).

Default : 5

int

compression (Optional)

Allowed values: "none", "snappy", "gzip","lzo","brotli","lz4","zstd"

Default: none

String enum

row-group-size(Optional)

Row Group Size in parquet file in byte values.

Default: 10 MB

String

page-size(Optional)

Page Size in parquet file in byte values.

Default: 8 KB

String

version(Optional)

Parquet Writer Version.

Allowed values : 1,2

Default: 1

String Enum

Supported Data Types

The supported types and the corresponding parquet types are in the following table:

Analytics Type Parquet Type

boolean

BOOLEAN

string

BINARY(STRING)

tinyint, smallint, int

INT32

bigint

INT64

float

FLOAT

double

DOUBLE

date

INT32(DATE)

time

INT32(TIME_MILLIS)

datetime

INT64(TIMESTAMP_MILLIS)

UUID

FIXED_LEN_BYTE_ARRAY(UUID)