Set Up Google Cloud Storage (GCS) External Source

  • Capella Analytics
  • how-to
    +
    To provide query access to OLAP data in GCS, you create an external link and associate it with an external collection.

    Prerequisites

    Your Capella Analytics account must have either the Project Owner or Project Manager role to be able to create a link for the external data.

    • If you want to access private data from a GCS bucket, you need credentials that can list and read data from that bucket. For more information, see Credentials.

    • You have the path to the data you want to access from your GCS bucket. For more information, see Location Path.

    Create a Link for GCS

    To create an external link to a GCS bucket:

    1. In the Capella UI, select the Capella Analytics tab.

    2. Click a cluster name.

    3. Use the explorer to explore the existing databases, scopes, and collections. You can add a database and scope if necessary: see Create a Database.

    4. Select Create  Data Link.

    5. Select Google Cloud Storage then click Continue

    6. In the Link Name field, enter a name for the link.

    7. (Optional) If the GCS bucket is private, add your JSON credentials to the Authentication field.

    8. Click Save & Continue to proceed. Capella Analytics creates the link to the GCS data source.

    Create a Collection for GCS Data

    You must create a collection for the data from your GCS bucket before you can query it in Capella Analytics. After you create the link to GCS, Capella Analytics prompts you to create a collection for your data. You can create the collection immediately by clicking Create Linked Collection. If you want to create the collection later, click Complete Later. When you’re ready to create the collection, hover over the link name’s under Links and select More Options (⋮)  Create Linked Collection.

    To complete creating the collection:

    1. On the Create Collection Linked to <GCSLinkName> dialog, select the database and scope and enter a name for the collection.

    2. In the GCS Bucket field, enter the name of a GCS bucket. Enter only the name of the bucket, not a URL.

    3. In the GCS Path field, enter one or more prefixes separated by slashes / to identify the location of the files you want to query. Do not include filenames in the path. To query files located at the top-most or bucket level, leave the path blank. See Design a Location Path.

    4. Choose the File Format of the files at that destination. Depending on the format you select, you may see additional fields:

      • CSV and TSV

      • Parquet

      • Define the data types for the fields in the files as a comma-separated list of <field-name> <datatype> values. The <datatype> is one of the primitive data types. If the field’s value does not match the data type, Capella Analytics ignores the record. You can also specify NOT UNKNOWN flag after the data type to have Capella Analytics ignore the record if the value is missing or null. For example:

        id BIGINT NOT UNKNOWN, firstname STRING, lastname STRING
      • Clear File includes header row if the first line of your CSV file is not a list of the columns in the file.

      • If your data uses a value other than an empty string ("") to indicate a null value, select Use custom string as Null and enter the value.

      Choose whether Capella Analytics should parse embedded JSON data and convert decimal values to doubles.

    5. (Optional) Use either the Include or Exclude field to specify files to include in, or exclude from, queries. You can use the following wildcards:

      • * matches any character or characters.

      • ? matches any single character.

      • [ sequence ] matches any characters in the supplied sequence.

      • [! sequence ] matches any characters not in the supplied sequence.

        For example, if the bucket stores both JSON and Parquet files, you can enter *.JSON in the Include field to query only the files that are in JSON format.

    6. Click Create Collection. Your link and collection appear under the scope in the explorer.

    The link is now available to provide your credentials whenever you query data in the external data source.

    Because the data in an external collection is not ingested into Capella Analytics and remains on the external host, Capella Analytics cannot index it.