Process and Vectorize Unstructured Data

how-to

Use a Capella AI Services Unstructured Data Workflow to automatically preprocess data for a Retrieval Augmented Generation (RAG) application or other use cases inside Capella. Convert your data into JSON from PDFs, JPGs, PNGs, and DOC or DOCX files and generate vector embeddings, all in one Workflow.

Capella can convert only JPG and PNG images of text to JSON data. Images that do not contain text cannot be converted by a Workflow. Make sure image files do not exceed the maximum image file size.

Workflows use your choice of embedding model to generate JSON data and vector embeddings, along with a Vector Search index, based on data stored in an Amazon S3 bucket. To generate your embeddings, you can use a model hosted by the Capella Model Service or OpenAI. Capella stores the generated JSON data, vector embeddings, and Vector Search index in an operational cluster.

To process your data effectively, you must choose a chunking strategy for your text. For more information, see Chunking.

If you make any changes to the data inside your Amazon S3 bucket, such as adding or removing files, you must manually trigger the Unstructured Data Workflow again to process these changes.

Prerequisites

You have an Amazon S3 bucket that contains data in 1 of the following formats: PDF, JPG, PNG, DOC, DOCX
Your Amazon S3 bucket does not have more than 10,000 files or files larger than 100 MB.
You have read-only credentials for your Amazon S3 bucket. For more information about AWS access keys, see the AWS documentation.
If you want to use a model hosted on Capella, you must have:
- Deployed a Capella embedding model. For more information, see Deploy an Embedding Model.
- Your model’s API Key ID and API Key Token. For more information about API keys for Capella models, see Get Started with AI Services APIs.
If you want to use a model hosted by OpenAI, you have your OpenAI API Key. For more information about how to find your OpenAI API Key, see the OpenAI Help Center.
You know the chunking strategy you want to use to process your data. For more information, see Chunking.
You have created an operational cluster in Capella that has the following:
- Couchbase Server version 8.0 or later.
- The Search Service and Eventing Service running on at least 1 Service Group. For more information, see Services and Service Groups.
- A bucket that can store your Vector Search index and any generated vector embeddings.
  
  Use any bucket settings you would prefer for your particular use case. For more information, see Manage Buckets.

Procedure

To create a new Unstructured Data Workflow and process unstructured data in Capella:

Go to AI Services Workflows.
Click Create New Workflow.
Click Unstructured Data from External sources.
In the Workflow Name field, enter a name to identify your Unstructured Data Workflow, or accept the automatically generated name.

Workflow names can be a maximum of 128 characters and can include letters (A-Z, a-z), numbers (0-9), dashes (-), and underscores (_).
Click Start Workflow.
Configure Your Amazon S3 Bucket.
Choose whether to Create HyperScale Vector Index (now) or Create HyperScale Vector Index (later).
Under Destination Cluster, in the Destination Operational Cluster list, select the cluster you configured in the Prerequisites.
Set the Destination Bucket, Destination Scope, and Destination Collection for your vector embeddings.
Configure Your Data Preprocessing Settings.
Choose Your Embedding Model.
Verify your workflow configuration.
Click Run Workflow.

Do not delete or modify the metadata scope, collections, or Eventing functions created by your new Workflow. If you modify or delete the metadata or functions, you must delete your Workflow and create a new one.

Configure Your Amazon S3 Bucket

Choose whether to use a new Amazon S3 bucket or choose an S3 bucket that you have already saved as an integration with Capella AI Services.

You can manage your saved Amazon S3 bucket credentials from the Integrations page.

New Amazon S3 Bucket
Use Existing Amazon S3 Bucket

To configure a new Amazon S3 bucket:

Click Add New S3 Bucket Integration.
In the Integration Name field, enter a name to use to identify your credentials and make it easier to manage them from the Integrations page.
Enter the details and credentials for accessing your Amazon S3 bucket.

It’s recommended to use read-only credentials for your S3 bucket. Make sure you have your Access Key ID and its Secret Access Key.

You can also choose to use temporary credentials, supported by a session token. For more information about configuring temporary credentials and session tokens, see the AWS documentation.
Click Add Credentials.
In the S3 Bucket Integration list, select your new S3 bucket.
Verify your S3 Integration Summary.
Continue with the rest of the Procedure.

To use an existing Amazon S3 bucket that you added to Capella AI Services:

In the S3 Bucket Integration list, select the S3 bucket where your unstructured data is stored.
Verify your S3 Integration Summary.
Continue with the rest of the Procedure.

Configure Your Data Preprocessing Settings

Choose the specific settings for processing your unstructured data:

(Optional) To set a specific inclusive range of document pages to process in your Workflow, turn on Include Page Range.
1. Using the Start Page and End Page fields, configure your inclusive page range.
  
  The page range must be valid for all documents stored in your S3 bucket.
(Optional) In the Layout Exclusions list, select the specific page layout elements that you want to exclude from vectorization.

For example, you could choose to exclude anything identified as a Header or Footer element from your workflow.
(Optional) If your documents include PNGs, JPGs, or PDFs, turn on Enable OCR to extract the text from these files.
Choose the chunking strategy, maximum chunk size, and chunk overlap for vectorizing your data.

For more information about chunking strategies, see Chunking.
Click Next.
Continue with the rest of the Procedure.

Choose Your Embedding Model

You can choose to use an embedding model hosted by the Capella Model Service or hosted by OpenAI to vectorize your data.

Use a Capella Model
Use an OpenAI Model

To use a Capella Model:

Click Capella Model.
Select the name of the model you want to use in this workflow.
Upload or manually enter your embedding model’s API Key ID and API Key Token. For more information about API keys for Capella models, see Get Started with AI Services APIs.
(Optional) Choose whether to set up Private Networking for your Capella embedding model. For more information about Private Networking for AI Services, see Add an AWS PrivateLink Connection.
Click Next.
Continue with the rest of the Procedure.

To use an OpenAI model:

Click External Model.
In the Choose OpenAI Model list, select the specific OpenAI model you want to use in this workflow.
(Optional) To use a new OpenAI API key, click Add New OpenAI API Key.
1. Enter a name to identify your API Key in Capella.
2. Enter your Secret Access Key from OpenAI.
3. Click Add Key.
In the Integrations Name list, select the OpenAI API Key you want to use.
Click Next.
Continue with the rest of the Procedure.

Workflows do not use the OpenAI Batch API.

Next Steps

The Capella AI Services UI shows the documents that have been processed by your Unstructured Data Workflow. You can click the Failed icon to view error information for failed documents.

Unstructured Data Workflows display with an S3 - Unstructured Type.

For more information about Workflow statuses, see Workflow Statuses.

You can also: