Design a Location Path
- Capella Columnar
- concept
To make querying an external data source more efficient, you supply a location path that consists of prefixes.
By supplying the prefixes, you reduce the number of files that Capella Columnar needs to read when you query an external data source.
About Static and Dynamic Prefixes
When you create an external collection in Capella Columnar, the location path that you provide acts as a filter on the target bucket. The more precise you can be when you define the path, the better your query performance. How you define the path can also affect hosting service costs by decreasing the number of requests to scan files and transfer results.
When you define a location path, you can include:
-
Static prefixes, which are the exact names of the prefixes that define a location in your S3 bucket. Using static prefixes results in a full scan of every file found under that prefix or set of prefixes, every time you query the collection.
-
Dynamic prefixes, which are placeholders or tokens for the actual path prefixes. You supply the values for these tokens in the
WHERE
clause of your queries. Using dynamic prefixes gives you more flexibility in how you set up your collections, helping you reduce the number of collections that you need to maintain.
A path can include either static or dynamic prefixes or both.
For more information about how you create an external collection, see Set Up an External Data Source or CREATE an External Collection.
For information about using prefixes in your S3 buckets, see Organizing objects using prefixes in the AWS documentation.
Examples of Paths and Queries for External Collections
A cloud storage bucket myBucket
contains files in the following locations:
reviews/2021/jan/reviews.json
reviews/2021/feb/reviews.json
reviews/2021/mar/reviews.json
...
reviews/2022/nov/reviews.json
reviews/2022/dec/reviews.json
There are 24 files named reviews.json, one for each month in 2021 and 2022. These locations, called prefixes (or folders if you use the Amazon S3 console), make the date range covered by a given file easier to find.
Each reviews.json file contains the following data:
{
"year" : 2022,
"month" : "dec",
"id" : int,
"date" : "dd/MM/yyyy",
"propertyName" : "value",
"comment" : "value",
"guestName" : "value"
},
Example: Static Prefixes
To query these files as efficiently as possible, you want your collections to include the full path.
Using static prefixes, you create an external collection for each month:
collection2021jan
with the path reviews/2021/jan
collection2021feb
with the path reviews/2021/feb
and so on, until you set up the 24th collection:
collection2022dec
with the path reviews/2022/dec
When you query any one of these collections, the collection supplies a precise location, and Capella Columnar reads only files found under that complete path.
You may find that querying 24 different collections using these different names is manageable. Over time, however, that number is likely to grow as you add a new collection for each month in 2023, 2024, and so on.
In addition, consider a path structure that includes prefixes like mobile/customer/1
through mobile/customer/99999999
, representing data for millions of different customers.
Setting up individual collections using only static prefixes would result in millions of collections, which would be difficult to manage.
Example: Static Prefixes and WHERE Clauses
Returning to the reviews.json
files, you decide to make collection management easier by creating just one collection for the reviews
path:
reviewsCollection
with the path reviews
When you query this collection, you supply a WHERE
clause to retrieve a specific set of results.
SELECT *
FROM reviews
WHERE year = 2021 and month = "jan";
This query returns exactly the results you were looking for, the reviews from January 2021.
To do so, Capella Columnar uses the path defined in the external collection to determine which files to read in the bucket.
As a result, this query scans all of the files under the reviews/
path.
The WHERE year = 2021 and month = "jan"
clause is applied only afterwards, to the data in the files found in reviews
, to filter the results after all files have been read.
While this approach successfully limits the number of collections that you need to maintain, it does not provide any efficiencies for querying.
If you used this approach for the mobile/customer/1
through mobile/customer/99999999
example, every query would read files for millions of customers.
Example: Dynamic Prefixes and WHERE Clauses
Another approach for your reviews.json
files is to set up a collection that uses dynamic prefixes instead of static prefixes.
Instead of 24 collections with static paths like reviews/2022/dec
, you set up a single collection:
reviewsCollection
with the path reviews/{year:int}/{month:string}
Dynamic prefixes use the format {TOKEN_NAME:DATA_TYPE}
, where:
-
TOKEN_NAME
identifies a placeholder. You can supply the name of a data field in the target file, or any other name you find memorable or useful. For information about how the choice of name can affect query results, see Example:"embed-filter-values": "true"
Adds Only. -
DATA_TYPE
identifies the value found in that position of the path. Acceptsint
,double
, orstring
.
When your queries include a WHERE
clause with values specified for the year
and month
, Capella Columnar uses them to replace the path tokens before reading the bucket:
-- this query reads every file under reviews, 24 files:
SELECT *
FROM reviews;
-- this query reads every file under the reviews/2022 path, 12 files:
SELECT *
FROM reviews
WHERE year = 2022;
-- this query reads every file under the reviews/2022/dec path, 1 file only:
SELECT *
FROM reviews
WHERE year = 2022 AND month = "dec";
By using dynamic prefixes in the collection’s path, the WHERE
clause of each query acts first as a filter on the files to read, and then again as a filter on the results returned from each file.
To use a dynamic prefix for the data in the mobile/customer/1
through mobile/customer/99999999
example, you could set up an external collection
customersCollection
with the path mobile/customer/{customerId:int}
.
and write queries that replace the customerId
token to filter what needs to be read:
SELECT *
FROM customer
WHERE customerId = SOME_INTEGER;
The WHERE
clause can use the following operators: =
, !=
, <
, >
, BETWEEN
, IN
.
Adding the Filter Value to Results
When you create an external collection that uses dynamic prefixes, you can include an embed-filter-values
parameter.
CREATE EXTERNAL COLLECTION reviewsCollection
ON `myBucket`
USING "reviews/{year:int}/{month:string}"
AT myLink
WITH {"format": "json", "embed-filter-values": "true"};
The embed-filter-values
parameter defines whether Capella Columnar modifies the objects returned in the query results.
-
"embed-filter-values": "false"
does not modify the objects-
If a data field with the same name as the
TOKEN_NAME
is not present, do not include the object in the results. -
If a data field with the same name is present, leave its value unchanged.
-
-
"embed-filter-values": "true"
can modify the objects-
If a data field with the same name as the
TOKEN_NAME
is not present, append it to each object as a new data field. As its value, supply the value found in the path position that the dynamic prefix represents. -
If a data field with the same name is present, overwrite its value with the value found in the path position that the dynamic prefix represents.
-
The embed-filter-values
parameter defaults to true.
Examples of Modified and Unchanged Results
The objects in the reviews.json
documents have the following data fields:
{
"year" : 2022,
"month" : "dec",
"id" : int,
"date" : "dd/MM/yyyy",
"propertyName" : "value",
"comment" : "value",
"guestName" : "value"
},
One of the objects is incomplete and does not include a month
value.
Another was misfiled: it includes a month, but it is not "dec".
{"id" : 1, "year" : 2022, "propertyName" : "El San Juan", "comment" : "value"},
{"id" : 2, "year" : 2022, "month" : "dec", "day": "7", "propertyName" : "San-Severin", "comment" : "value"},
{"id" : 3, "year" : 2022, "month" : "jul", "day": "14", "propertyName" : "The Mark", "comment" : "value"}
Example: "embed-filter-values": "false"
With "embed-filter-values": "false"
defined for the external collection:
SELECT *
FROM reviewsCollection
WHERE year = 2022 AND month = "dec";
The query returns only one object, the one with month = "dec"
, in the results.
{"id" : 2, "year" : 2022, "month" : "dec", "day": "7", "propertyName" : "San-Severin", "comment" : "value"},
The query filters out the two objects that do not have month = "dec"
.
Example: "embed-filter-values": "true"
Adds and Overwrites
With "embed-filter-values": "true"
defined for the external collection:
SELECT * FROM reviewsCollection WHERE year = 2022 AND month = "dec";
This query returns all three objects in the results:
{"id" : 1, "year" : 2022, "month" : "dec", "propertyName" : "El San Juan", "comment" : "value"}, {"id" : 2, "year" : 2022, "month" : "dec", "day": "7", "propertyName" : "San-Severin", "comment" : "value"}, {"id" : 3, "year" : 2022, "month" : "dec", "day": "14", "propertyName" : "The Mark", "comment" : "value"}
In these results, all of the objects now include "month" : "dec"
:
-
The object that did not have a value for the
"month"
now includes"month" : "dec"
. -
The object that had a different value for the
"month"
now has"month" : "dec"
.
Example: "embed-filter-values": "true"
Adds Only
An option to consider is that your dynamic prefixes can have any name. They do not have to match the names of data fields in your files.
You set up your collection with a dynamic prefix token of "mmm"
instead of "month"
:
CREATE EXTERNAL COLLECTION reviewsCollection ON myBucket USING "reviews/{year:int}/{mmm:string}" AT myLink WITH {"format": "json", "embed-filter-values": "true"};
A query with a WHERE year = 2022 AND mmm = "dec"
clause returns all three objects in the results:
{"id" : 1, "year" : 2022, "mmm" : "dec", "propertyName" : "El San Juan", "comment" : "value"}, {"id" : 2, "year" : 2022, "mmm" : "dec", "month" : "dec", "day": "7", "propertyName" : "San-Severin", "comment" : "value"}, {"id" : 3, "year" : 2022, "mmm" : "dec", "month" : "jul", "day": "14", "propertyName" : "The Mark", "comment" : "value"}
In these results all of the objects now include "mmm" : "dec"
.
-
The object that did not include a
"month"
still does not include the"month"
. -
The object that had a different value for the
"month"
still has that same value.
For additional examples, see Optimize Performance of External Analytics Collections in Couchbase.