Process Your Data For Capella AI Services

concept

Use Capella AI Services Workflows to prepare, process, and vectorize text for use with other Capella AI Services.

The following Workflows are available through Capella AI Services:

Unstructured Data from S3: Use when your data is not yet in JSON format and is stored inside an Amazon S3 bucket.
Structured Data from External Sources: Use when your data has already been pre-processed and stored in JSON format on an Amazon S3 bucket.
Data from Capella: Use when your data has already been pre-processed and stored in a Capella operational cluster.

All Workflows require an embedding model to generate vectors. You can use an embedding model hosted by the Capella Model Service or OpenAI.

You must configure specific data preprocessing options when using an Unstructured Data Workflow. Keep the Unstructured Data Processing Limitations in mind when processing unstructured data.

Vectorization

Using the Data Processing Service, Capella AI Services preprocesses and chunks your data, removing elements to reduce your signal-to-noise ratio and converting data into JSON. It uses the Vectorization Service to take processed JSON data and generate vector embeddings.

Vectors are a numerical representation of complex unstructured data. They distill this complex data into an array of floating-point values called dimensions. The dimensions in vectors capture features of the data in a way that makes them easy to compare mathematically.

For more information about vectors, see About Vectors and Embedding Vectors.

Workflows create dedicated metadata collections and Eventing functions on your chosen Capella operational cluster:

A Workflow’s metadata is stored in collections inside the vectorization-meta-data scope.
Every Workflow creates 2 Eventing functions:
- vec_ctr_$WORKFLOW_ID
- vec_wkr_$WORKFLOW_ID

Do not modify or delete the Workflows metadata scope, its collections, or the Eventing functions. If you delete or modify the scope, collections, or Eventing functions, you must delete your Workflow and create a new one.

Data Preprocessing Options

When using an Unstructured Data Workflow, you can configure data preprocessing options to control and fine-tune your data before generating vector embeddings.

You can configure the following options:

(Optional) Page ranges to include from documents
(Optional) Specific document elements to remove
(Optional) Optical Character Recognition (OCR) for processing text from images or PDFs
How the Workflow creates chunks from your documents

Page Range

You can set an inclusive range of pages for any documents processed through your Workflow.

Capella ignores any content on document pages not included in this range.

Layout Exclusions

You can choose specific document elements to remove from processing through your Workflow.

These specific elements of unstructured documents can add noise to your vectorization process, and do not include meaningful semantic content. Remove these elements to improve the signal-to-noise ratio, and reduce the storage and processing requirements for vectorizing your data.

You can remove the following document elements from processing through an Unstructured Data Workflow:

Tables
Footers
Document titles and section headers

By default, your Unstructured Data Workflow enables Exclude Footer and Exclude Header.

Optical Character Recognition (OCR)

You can enable OCR to extract text from JPEGs, PNGs, and PDFs.

If you do not have images or PDF files in your data, you do not need to enable OCR.

Chunking

You can configure how your Unstructured Data Workflow chunks, or divides your documents into smaller segments. Chunks make it easier for a model to search and retrieve only the relevant information from your documents.

You can choose a chunking strategy, maximum chunk size, and chunk overlap.

Chunks can be set to a minimum of 256 and a maximum of 8192 tokens.

If you choose to set a chunk overlap, your Unstructured Data Workflow makes sure 0-4 tokens overlap between chunks.

Chunking Strategies

The following chunking strategies are available in Capella AI Services:

Chunking Strategy	Semantic Preservation	Common Use Cases
FIXED_TOKEN_SPLITTER	Low semantic preservation that ignores document structure and can break natural semantic boundaries.	Generic document collections Machine learning training data Log files Web crawled content
PARAGRAPH_SPLITTER	High semantic preservation by preserving the broader contextual reasoning and narrative flow.	Academic papers Long-form articles Research reports Narrative texts
RECURSIVE_SPLITTER	Medium-to-high semantic preservation that preserves meaning within chunks, particularly for highly structured or hierarchical documents.	Long reports or research papers Legal documents Technical documentation Books or articles
SEMANTIC_SPLITTER	High semantic preservation that preserves semantic meaning within chunks, by comparing text segments for semantic similarity and creating new chunks at logical breaks. Content is separated into chunks based on themes, rather than just structural elements, making comprehensive information retrieval easier.	Complex legal documents Technical documentation or manuals Long reports or research papers Narrative texts
SENTENCE_SPLITTER	High semantic preservation by preserving precise context in sentences, but may struggle with inter-sentence references.	Structured reports Concise articles Scientific abstracts Customer support transcripts

Chunking Strategy

Semantic Preservation

Common Use Cases

FIXED_TOKEN_SPLITTER

Low semantic preservation that ignores document structure and can break natural semantic boundaries.

Generic document collections
Machine learning training data
Log files
Web crawled content

PARAGRAPH_SPLITTER

High semantic preservation by preserving the broader contextual reasoning and narrative flow.

Academic papers
Long-form articles
Research reports
Narrative texts

RECURSIVE_SPLITTER

Medium-to-high semantic preservation that preserves meaning within chunks, particularly for highly structured or hierarchical documents.

Long reports or research papers
Legal documents
Technical documentation
Books or articles

SEMANTIC_SPLITTER

High semantic preservation that preserves semantic meaning within chunks, by comparing text segments for semantic similarity and creating new chunks at logical breaks. Content is separated into chunks based on themes, rather than just structural elements, making comprehensive information retrieval easier.

Complex legal documents
Technical documentation or manuals
Long reports or research papers
Narrative texts

SENTENCE_SPLITTER

High semantic preservation by preserving precise context in sentences, but may struggle with inter-sentence references.

Structured reports
Concise articles
Scientific abstracts
Customer support transcripts

Fixed-length (text) chunking: Divides text into uniform chunks based on the chosen maximum chunk size. This method can help with consistent processing across large documentation sets, even when individual sections vary greatly in length. However, it can break semantic units and is unsuitable for preserving semantic coherence.

For example, you could split a document into chunks of 100 tokens, regardless of whether the chunk ends mid-sentence or across paragraphs. This option is useful for summarizing and searching extensive log files.

Paragraph chunking: Divides text into full paragraphs. This method preserves paragraph integrity and can maintain context within logical text units.

For example, paragraph chunking might break a document into chunks where each has several paragraphs, maintaining the semantic integrity of each chunk while keeping the size manageable. However, there can be variation in chunk sizes since paragraphs can vary greatly in length. Preserving context and reasoning within each paragraph is useful for applications like finding information in large, complex documents.

Recursive chunking: Divides text into increasingly smaller segments based on document structure and semantic units. This method preserves document hierarchy by splitting at natural boundaries like paragraphs, sentences, and then words using a top-down approach.

For example, recursive chunking might break a document first into paragraphs. If a paragraph exceeds the chunk size, it moves to the next level by breaking it into sentences. This process continues down to individual words if necessary. This hierarchical approach maintains context between chunks while keeping individual segments at a manageable size for processing. This approach is useful for applications that need to understand document structure and relationships, such as legal documents.

Semantic chunking: Divides text into smaller, self-contained units based on meaning and context. This method uses embedding models to create mathematical representations of text segments, comparing their semantic similarity to find logical breaks for new chunks.

Semantic chunking makes sure each chunk conveys a unified idea and preserves coherent context for RAG applications. It uses a specific algorithm that requires a specific chunk size and a chunk overlap, to help determine where to create breaks in the content.

Semantic chunking is the default option for Capella AI Services, as it can provide the best results.

Sentence chunking: Divides text into full sentences, ensuring each chunk contains a complete thought. This method helps preserve the logical flow of information by splitting natural sentence boundaries.

For example, sentence-level chunking breaks a document into chunks that contain 1 complete sentence, maintaining the semantic integrity of each chunk while keeping the size manageable. This approach allows for precise retrieval of specific details, which is crucial for applications like accurate customer support responses.

Unstructured Data Processing Limitations

Unstructured Data Workflows have the following file size or file number limitations:

Limitation	Value
Maximum file size	100 MB
Maximum number of files allowed per Workflow	10,000
Maximum image size	32,767 pixels

Limitation

Value

Maximum file size

100 MB

Maximum number of files allowed per Workflow

10,000

Maximum image size

32,767 pixels

Workflow Statuses

A Capella AI Services Workflow can have 1 of the following statuses:

State	Description
Deploying	The resources for this Workflow are currently being deployed on AI services.
Deploy Failed	The required resources for this Workflow failed to deploy. You can delete this Workflow and try to deploy a new one.
Pending	The deployment process for the Workflow is taking longer than expected.
Running	The Workflow is currently processing documents. You can stop the Workflow while it’s running by going to More Options (⋮) Stop Workflow on the Workflows page.
Completed	The Workflow has finished processing documents. A Workflow can change to the Completed state even if it’s only processed a single document. You can rerun the Workflow by going to going to More Options (⋮) Rerun Workflow on the Workflows page.
Failed	The Workflow failed during processing. Click to view more information about the error. You can rerun the Workflow by going to going to More Options (⋮) Rerun Workflow on the Workflows page.
Stopping	The Workflow is currently finishing processing any currently ingested documents and preparing to stop.
Stopped	The Workflow has stopped processing documents. You can rerun the Workflow by going to going to More Options (⋮) Rerun Workflow on the Workflows page.
Stop Failed	An error occurred while stopping the Workflow. Contact Couchbase Capella Support to delete the Workflow.
Destroying	The Workflow’s resources are being deleted.
Destroy Failed	An error occurred while deleting the Workflow. Contact Couchbase Capella Support.

State

Description

Deploying

The resources for this Workflow are currently being deployed on AI services.

Deploy Failed

The required resources for this Workflow failed to deploy.

You can delete this Workflow and try to deploy a new one.

Pending

The deployment process for the Workflow is taking longer than expected.

Running

The Workflow is currently processing documents.

You can stop the Workflow while it’s running by going to More Options (⋮) Stop Workflow on the Workflows page.

Completed

The Workflow has finished processing documents. A Workflow can change to the Completed state even if it’s only processed a single document.

You can rerun the Workflow by going to going to More Options (⋮) Rerun Workflow on the Workflows page.

Failed

The Workflow failed during processing. Click to view more information about the error.

You can rerun the Workflow by going to going to More Options (⋮) Rerun Workflow on the Workflows page.

Stopping

The Workflow is currently finishing processing any currently ingested documents and preparing to stop.

Stopped

The Workflow has stopped processing documents.

You can rerun the Workflow by going to going to More Options (⋮) Rerun Workflow on the Workflows page.

Stop Failed

An error occurred while stopping the Workflow.

Contact Couchbase Capella Support to delete the Workflow.

Destroying

The Workflow’s resources are being deleted.

Destroy Failed

An error occurred while deleting the Workflow.

Contact Couchbase Capella Support.

Workflow Billing

For information about how Couchbase bills you according to your Workflow usage, see Workflow Billing.

Process Your Data For Capella AI Services

Vectorization

Data Preprocessing Options

Page Range

Layout Exclusions

Optical Character Recognition (OCR)

Chunking

Chunking Strategies

Unstructured Data Processing Limitations

Workflow Statuses

Workflow Billing

See Also