Vectorize Structured Data from Capella

how-to

Use a Data from Capella Workflow to automatically generate embedding vectors from JSON data in your Capella operational cluster. Use embedding vectors for similarity searches on your data.

The Vectorization Service automatically creates a Vector Search index for your embeddings - letting you get started right away with Vector Search. You can use Vector Search to support Retrieval Augmented Generation (RAG) in your applications, or for other vector similarity use cases.

Your data must already be extracted, filtered, and chunked in preparation for generating embeddings.

To generate your embeddings, you can use a model hosted by the Capella Model Service or OpenAI. Capella stores the generated vector embeddings and Vector Search index in an operational cluster.

Prerequisites

You have data available in JSON format inside a Capella operational cluster. If your data is not yet in JSON format, see Process and Vectorize Unstructured Data.
If you want to use a model hosted on Capella, you must have:
- Deployed a Capella embedding model. For more information, see Deploy an Embedding Model.
- Your model’s API Key ID and API Key Token. For more information about API keys for Capella models, see Get Started with AI Services APIs.
If you want to use a model hosted by OpenAI, you have your OpenAI API Key. For more information about how to find your OpenAI API Key, see the OpenAI Help Center.
You have deployed an operational cluster that has the following:
- Couchbase Server version 8.0 or later.
- The Search Service and Eventing Service running on at least 1 Service Group. For more information, see Services and Service Groups.
- A bucket that can store the Vector Search index and any generated vector embeddings.
  
  Use any bucket settings you would prefer for your particular use case. For more information, see Manage Buckets.

Procedure

To create a new Data from Capella workflow and process your JSON data from a Capella operational cluster:

Go to AI Services Workflows.
Click Create New Workflow.
Click Data from Capella.
In the Workflow Name field, enter a name to identify your Data from Capella Workflow, or accept the automatically generated name.

Workflow names can be a maximum of 128 characters and can include letters (A-Z, a-z), numbers (0-9), dashes (-), and underscores (_).
Click Start Workflow.
Under Data Source, in the Cluster list, select the operational cluster where your data is stored. Your cluster must meet the criteria in the Prerequisites to appear in the list.
Use the Bucket, Scope, and Collection lists to set where your data is stored on your operational cluster.
Configure Your Source Fields.
Choose whether to Create HyperScale Vector Index (now) or Create HyperScale Vector Index (later).
Choose Your Embedding Model.
Verify your workflow configuration.
Click Run Workflow.

Do not delete or modify the metadata scope, collections, or Eventing functions created by your new Workflow. If you modify or delete the metadata or functions, you must delete your Workflow and create a new one.

Configure Your Source Fields

You must configure whether the Vectorization Service should store all vectors generated from your documents in a single field, or create a custom source field mapping.

Map all source fields to a single vector field
Create custom source field mappings

To map all of the fields in your documents to a single vector field:

Click Map all source fields to a single vector field.
(Optional) Under Vector Field, enter a name for the field where you want to store your vectors.
Continue with the rest of the Procedure.

To create custom mappings and only vectorize specific fields from your documents:

Click Create custom source field mappings.
Under Source Fields, click the list.
Select every field from your source documents that you want to vectorize and store in a single field.
In the corresponding field under Vector Field, enter a name for the new vector field for your selected source field or fields.
(Optional) To map additional fields to another vector field, click Add more mapping and repeat Steps 2-4.
Continue with the rest of the Procedure.

Choose Your Embedding Model

You can choose to use an embedding model hosted by the Capella Model Service or hosted by OpenAI to vectorize your data.

Use a Capella Model
Use an OpenAI Model

To use a Capella Model:

Click Capella Model.
Select the name of the model you want to use in this workflow.
Upload or manually enter your embedding model’s API Key ID and API Key Token. For more information about API keys for Capella models, see Get Started with AI Services APIs.
(Optional) Choose whether to set up Private Networking for your Capella embedding model. For more information about Private Networking for AI Services, see Add an AWS PrivateLink Connection.
Click Next.
Continue with the rest of the Procedure.

To use an OpenAI model:

Click External Model.
In the Choose OpenAI Model list, select the specific OpenAI model you want to use in this workflow.
(Optional) To use a new OpenAI API key, click Add New OpenAI API Key.
1. Enter a name to identify your API Key in Capella.
2. Enter your Secret Access Key from OpenAI.
3. Click Add Key.
In the Integrations Name list, select the OpenAI API Key you want to use.
Click Next.
Continue with the rest of the Procedure.

Workflows do not use the OpenAI Batch API.

Next Steps

The Capella AI Services UI shows the documents that have been processed by your Data from Capella Workflow. You can click the Failed icon to view error information for failed documents.

Data from Capella Workflows for Capella data display with a Capella Type.

For more information about Workflow statuses, see Workflow Statuses.

You can also: