Filtered Search Using Composite Vector Indexes

how-to

A Composite Vector index is a Global Secondary Index (GSI) with a single vector column that combines scalar queries with semantic search. The added vector column lets your application perform a query using both the index’s scalar, array, and object index entries to pre-filter the dataset before performing a vector similarity search.

How the Composite Vector Index’s Vector Column Works

The Composite Vector index’s single vector column enables semantic and similarity searches within your SQL++ queries. When creating the index, you use a VECTOR key attribute to identify the key that contains the embedded vectors.

When your query contains an embedded vector, the Index Service uses any non-vector predicates in the query to filter index entries. Then it performs a vector similarity search to locate semantically related vectors. Handling the non-vector predicates first reduces the number of vector similarity comparisons the Index Service must do to find similar vectors.

Prerequisites

You must have the Index Service enabled on at least one node in your cluster. For more information about how to deploy a new node and Services on your database, see Manage Nodes and Clusters.
You must have a bucket with scopes and collections in your database. For more information about how to create a bucket, see Create a Bucket.
Your account must have the Query Manage Index or an administrator role to be able to create an index.
You have documents in a collection that contain one or more vector embeddings. You can add a single vector to a Composite Vector index. If your documents contain multiple embedded vectors, you can create multiple indexes — one for each vector attribute.

Embeddings can be an array of floating point numbers or a base64 encoded string. Couchbase Server does not embed vectors itself. You must use an external embedding model to embed vectors into your data and add them to your documents.
You must know the number of dimensions the vector contains. The embedding model you use to embed the vectors may determine this value for you. For example, OpenAI API’s text-embedding-ada-002 embedding model that embedded the sample data demonstrated later in this page creates vectors that have 1536 dimensions.
You must decide what distance metric and quantization you want your index to use. The metrics affect how the index compares vectors. The quantization determines how much memory your index uses and the amount of processing Couchbase Server must perform to train and search them. See Vector Similarity Metrics and Quantization for more information.

Examples on this Page

You can download a sample dataset to use with the procedure or examples on this page:

Download color_data_2vectors.zip

To get the best results with using the sample data with the examples in this documentation, import the sample files from the dataset into your database with the following settings:

Use a bucket called vector-sample.
Use a scope called color.
Use a collection called rgb for rgb.json.
To set your document keys, use the value of the id field from each JSON document.

Create a Composite Vector Index

Creating a Composite Vector index is similar to creating a non-vector GSI index. See Create Indexes for an overview of creating indexes. In the CREATE INDEX statement to create the Composite Vector index, add the VECTOR key attribute after the vector’s key name to declare it as an embedded vector.

The index key that refers to a vector field may be the only index key. If there are multiple index keys, the index key referring to the vector field may be any of the index keys, including the leading index key.

You must also use the WITH clause to specify some additional information for the vector column. The format for this clause with the most commonly used parameters is:

WITH {"dimension": <dimensions>,
      "similarity": <similarity_metric>,
      "description": <centroids_and_quantization>
     };

The WITH clause can contain other parameters that affect how the index processes vectors. For a full list of the parameters that affect a Composite Vector index, see CREATE INDEX in the SQL++ for Query Reference.

dimensions is an integer value that sets the number of dimensions in the vector. This value is set by the embedded model you used to embed the vectors.
similarity_metric is a string that sets the distance metric to use when comparing vectors during index creation.

Couchbase Server uses the following strings to represent the distance metrics:
- COSINE: Cosine Similarity
- DOT: Dot Product
- L2 or EUCLIDEAN: Euclidean Distance
- L2_SQUARED or EUCLIDEAN_SQUARED: Euclidean Squared Distance
For the greatest accuracy, use the distance function you plan to use when querying vector data.
centroids_and_quantization is a string containing the settings for the quantization and index algorithms. See Quantization and Centroid Settings in the next section for more information.

Quantization and Centroid Settings

When creating an index that includes a vector field, you choose settings that affect how the index processes vectors. The parameter named description is the primary setting that controls the quantization and the number of centroids Couchbase Server to create the index. Using it, you control how the index subdivides the dataset to improve performance and how it quantizes vectors to reduce memory and processing requirements.

The description parameter is a string in the following format:

description ::= '"' 'IVF' number-of-centroids? ',' ('SQ' sq-settings | 'PQ' pq-settings) '"'

The following sections describe the settings for centroids and quantization.

Number of Centroids

Composite Vector index uses several algorithms to organize its data to improve its performance. One of these algorithms, Inverted File (IVF), has a setting you can adjust to control how it subdivides the dataset. The other algorithms that Composite Vector index uses do not have settings you can adjust.

The key setting for IVF is the number of centroids it allocates for the index. This setting controls how large the centroids are. Larger centroids have more vectors associated with them.

You can have Couchbase Server choose a number of centroids for you by not providing a value after the IVF in your description parameter. It sets the number of centroids to the number of vectors in the dataset divided by 1000.

You can manually set the number of centroids for the index by adding an integer value after the IVF in the description parameter. The number of centroids you set manually must be less than the number of vectors in the dataset.

The number of centroids affects the performance of your Composite Vector index index in two ways:

If the index has fewer centroids, each centroid is larger (has more vectors associated with it). In this case, a vector search has to perform more comparisons, making the search slower. However, having fewer centroids decreases the processing required to train the index.
A greater number of centroids results in a greater processing cost for training. This increase is due to the training process having to search for more data cluster to identify more centroids. However, it reduces the number of vectors associated with each centroid. This reduction makes search faster by limiting the number of vector comparisons during a search.

You may need to experiment with different numbers of centroids to find the best setting for your dataset and queries.

See nList for more guidance on choosing the number of centroids for the index.

Quantization Setting

Composite Vector index always uses quantization to reduce the size of vectors stored in the index. You must choose whether the index uses Scalar Quantization (SQ) or Product Quantization (PQ). See Quantization for guidance on choosing the quantization method for the index.

You select the quantization by adding a comma followed by either PQ or SQ to the description parameter after the IVF setting in the description value.

Each quantization method has additional settings explained in the following sections.

SQ Settings

For SQ, you set the number of bits the SQ algorithm uses for the bin index value, or the number of bits it uses to store the centroid for each bin. The values for SQ that Couchbase Server supports are:

Setting Effect

Setting	Effect
`SQ4`	SQ uses a 4-bit index value splitting each vector dimension into 16 subspaces.
`SQ6`	SQ uses a 6-bit index value splitting each vector dimension into 64 subspaces.
`SQ8`	SQ uses an 8-bit index value splitting each vector dimension into 256 subspaces.

SQ4

SQ uses a 4-bit index value splitting each vector dimension into 16 subspaces.

SQ6

SQ uses a 6-bit index value splitting each vector dimension into 64 subspaces.

SQ8

SQ uses an 8-bit index value splitting each vector dimension into 256 subspaces.

See Scalar Quantization for more information about how SQ works.

PQ Settings

If you choose to use PQ in your index, you must set two values:

The number of subquantizers (number of subspaces PQ splits the vector’s dimensions into) to use. This value must be a divisor of the number of dimensions in the vector. For example, if your vector has 99 dimensions, you can only use the values 3, 9, 11, 33, and 99 for the subquantizers. Using any other value returns an error.
The number of bits in the centroid’s index value. This value sets the number centroids to find in each subspace. For example, setting this value to 8 has PQ store the index for the centroids in a byte. This results in SQ using 256 centroids per subspace.

The number of centroids you set using this value must be less than the number of vectors in the dataset. For example, if you choose 32 for the centroid index size, your dataset must have at least 4,294,967,296 vectors in it.

The larger you set either of these values, the more accurate the index’s search results are. The trade-off is that your index is larger, as it has to store data for more centroids. A smaller value results in a smaller index that returns less accurate results.

The format for the PQ settings is:

pq-settings ::= 'PQ' subquantizers 'x' number-of-bits

For example, PQ32x8 has PQ break the vector’s dimensions into 32 subspaces, each of which has 256 centroids. See Product Quantization for more information about how PQ works.

Algorithm Settings Examples

The following table shows several description values along with an explanation.

Setting Effect

Setting	Effect
`IVF,SQ8`	Couchbase Server chooses the number of centroids the IVF algorithm uses. The index uses Scalar Quantization with an 8-bit index, meaning it breaks each of the vector’s dimensions into 256 bins.
`IVF1024,PQ8x8`	IVF uses 1024 centroids to divide the dataset. The index uses Product Quantization. PQ breaks the vector space into 8 subspaces, each of which uses 8-bits to represent centroids in the subspace. This settings means each subspace has 256 centroids.

IVF,SQ8

Couchbase Server chooses the number of centroids the IVF algorithm uses. The index uses Scalar Quantization with an 8-bit index, meaning it breaks each of the vector’s dimensions into 256 bins.

IVF1024,PQ8x8

IVF uses 1024 centroids to divide the dataset. The index uses Product Quantization. PQ breaks the vector space into 8 subspaces, each of which uses 8-bits to represent centroids in the subspace. This settings means each subspace has 256 centroids.

Examples

The following examples show you how to create two Composite Vector index with a vector column using sample data. They both use the data from the color_data_2vectors.zip file mentioned earlier.

The following query gets a single document from the rgb collection in the vector-sample bucket’s color scope. It truncates the embedding_vector_dot attribute to the first four values to improve readability.

SELECT RAW OBJECT_PUT(d, "embedding_vector_dot",
  ARRAY_CONCAT(d.embedding_vector_dot[0:4], ["..."])
)
FROM `vector-sample`.`color`.`rgb` AS d
USE KEYS ["#FFEFD5"];

The result of running this query is:

[{
    "brightness": 240.82,
    "color": "papaya whip",
    "colorvect_l2": [
        255,
        239,
        213
    ],
    "description": "Papaya whip is a soft and mellow color that can be 
                   described as a light shade of peach or coral. It has 
                   a calming and soothing effect, similar to the tropical 
                   fruit it is named after. This color is perfect for 
                   creating a warm and inviting atmosphere, and it pairs 
                   well with other pastel shades or neutral tones. Papaya 
                   whip is a versatile color that can be used in both fashion 
                   and interior design, adding a touch of elegance and 
                   sophistication to any space.",
    "embedding_model": "text-embedding-ada-002-v2",
    "embedding_vector_dot": [
        -0.014644118957221508,
        0.017003899440169334,
        -0.013450744561851025,
        0.0021356006618589163,
        "..."
    ],
    "id": "#FFEFD5",
    "verbs": [
        "soften",
        "mellow",
        "lighten"
    ],
    "wheel_pos": "other"
}]

Index the RGB Values

The rgb.json file’s colorvect_l2 attribute defines an array containing the RGB values for the entry’s color. While this technically is not an embedded vector, you can still create a vector index column for this array. The following example creates a Composite Vector index for this attribute as an embedded vector as well as the color’s name and brightness.

CREATE INDEX `color_vectors_idx` ON `vector-sample`.`color`.`rgb`
       (`colorvect_l2` VECTOR, color, brightness)
       WITH {  "dimension":3 , "similarity":"L2", "description":"IVF,SQ8"};

In this example:

The number of dimensions is 3, because there are three values in the array containing the RGB value.
The similarity function is L2. This function works well to find related vectors which are close by the search vector. In this example, finding similar colors depends more on proximity than the magnitude or alignment of the vectors. See Vector Similarity for a comparison of similarity functions.
The description lets Couchbase Server decide the number of centroids for the IVF algorithm. It also chooses to use Scalar Quantization with an 8-bit index, splitting each dimension into 256 bins. This setting does not actually save any space in the index, as each of the RGB dimensions are already 8-bit values. However, in this example, memory use is not a concern as the dataset is small.

The result of running example is:

[
  {
    "id": "f572fa0b1c7358ee",
    "name": "color_vectors_idx",
    "state": "online"
  }
]

Create a Composite Vector Index Using the Embedded Vectors

The embedding_vector_dot attribute contains the embedded vectors for the text in the description attribute. The data sample shown in Examples truncated this attribute to several values. The embedded vector contains 1536 dimensions.

The following example creates a Composite Vector index that indexes the embedded vectors in the embedding_vector_dot as well as indexing the scalar color that contains the color’s name and brightness.

CREATE INDEX `color_desc_idx` ON `vector-sample`.`color`.`rgb` 
     (`embedding_vector_dot` VECTOR, color, brightness) 
     WITH { "dimension":1536, "similarity":"DOT", "description":" IVF,SQ8" }

This example uses the Dot Product similarity function. This function works better with the embedded text content than the Euclidean function used in the previous example. It also uses the same algorithms as the previous example — Couchbase Server chooses the number of centroids, and uses SQ quantization with 256 bins.

If successful, Couchbase Server responds with:

[
  {
    "id": "c965205718c3e4c2",
    "name": "color_desc_idx",
    "state": "online"
  }
]

After Couchbase Server creates the index, it begins training it. Depending on your system, this training can take several seconds.

Create a Composite Vector Index with a Scalar Leading Key

If a Composite Vector index has multiple index keys, the leading index key may be a scalar field, rather than the vector field. This enables the Index Service to pre-filter using a scalar field. This is a key advantage that a Composite Vector index has over a Hyperscale Vector index with included scalar fields.

The following example creates a Composite Vector index that indexes the scalar color and brightness fields, as well as the embedded vectors in the embedding_vector_dot field.

CREATE INDEX `color_name_idx` ON `vector-sample`.`color`.`rgb`
     (color, brightness, `embedding_vector_dot` VECTOR)
     WITH { "dimension":1536, "similarity":"DOT", "description":" IVF,SQ8" }

If successful, Couchbase Server responds with:

[
  {
    "id": "45222612ffac4e98",
    "name": "color_name_idx",
    "state": "online"
  }
]

After Couchbase Server creates the index, it begins training it. Depending on your system, this training can take several seconds.

Query with a Composite Vector Index

You query embedded vector attributes to find similar vectors, and therefore similar semantic content. To find the most similar vectors, use an ORDER BY clause in your query to return the most relevant vectors first. In this clause, call one of two functions that actually performs the vector comparisons: APPROX_VECTOR_DISTANCE or VECTOR_DISTANCE. The first of these functions is faster, but less precise. The second is more precise, but slower. Which you choose depends on your use case.

To use your Composite Vector index, use the SELECT statement with an ORDER BY clause containing a call to the APPROX_VECTOR_DISTANCE function. This function selects the Composite Vector index when you query the vector key that the index covers.

You should use a LIMIT clause to return just the number of vectors you need. The query pushes the LIMIT clause down into the index scan so that the scan ends after finding the required number of matches.

You can also perform vector comparisons using the VECTOR_DISTANCE function. This function does not select a Hyperscale Vector index or a Composite Vector Index, but performs a brute-force comparison.

Query RGB Values

Querying the RGB values in rgb.colorvect_l2 requires a vector with only three values. You can just specify the vector by hand. The following example finds colors that are similar to gray, which has an RGB value of 128, 128, 128:

SELECT b.color, b.colorvect_l2, b.brightness from `rgb` AS b
ORDER BY APPROX_VECTOR_DISTANCE(b.colorvect_l2,[128,128,128],"L2") 
LIMIT 5;

The query uses the APPROX_VECTOR_DISTANCE function to sort the results. You pass it the vector column to search, the vector to search for (in this case, the array 128, 128, 128) and the distance function.

For the best accuracy, use the same distance function you specified when creating the Composite Vector index (in this case, L2). The query pushes the LIMIT clause down into the index scan, so once it finds the 5 entries that satisfy the query, it exits.

The top result is the entry for gray. The other results are all shades of gray:

[{
        "color": "grey",
        "colorvect_l2": [
            128,
            128,
            128
        ],
        "brightness": 128
    },
    {
        "color": "slate gray",
        "colorvect_l2": [
            112,
            128,
            144
        ],
        "brightness": 125.04
    },
    {
        "color": "light slate gray",
        "colorvect_l2": [
            119,
            136,
            153
        ],
        "brightness": 132.855
    },
    {
        "color": "light gray",
        "colorvect_l2": [
            144,
            144,
            144
        ],
        "brightness": 144
    },
    {
        "color": "dim gray",
        "colorvect_l2": [
            105,
            105,
            105
        ],
        "brightness": 105
    }
]

You can also add other predicates to help reduce the workload of searching for similar vectors by excluding vectors. The following example searches for colors similar to gray which has an RGB value of 128, 128, 128 and have a brightness greater than 128:

SELECT b.color, b.colorvect_l2, b.brightness from `rgb` AS b
WHERE b.brightness > 128 
ORDER BY APPROX_VECTOR_DISTANCE(b.colorvect_l2,[128,128,128],"L2") 
LIMIT 5;

The result of running this query are:

[{
        "color": "light slate gray",
        "colorvect_l2": [
            119,
            136,
            153
        ],
        "brightness": 132.855
    },
    {
        "color": "light gray",
        "colorvect_l2": [
            144,
            144,
            144
        ],
        "brightness": 144
    },
    {
        "color": "cadet blue",
        "colorvect_l2": [
            95,
            158,
            160
        ],
        "brightness": 139.391
    },
    {
        "color": "rosy brown",
        "colorvect_l2": [
            188,
            143,
            143
        ],
        "brightness": 156.455
    },
    {
        "color": "dark sea green",
        "colorvect_l2": [
            143,
            188,
            143
        ],
        "brightness": 169.415
    }
]

Query the Embedded Vectors

To query the color_desc_idx Composite Vector index containing the embedded vector for the description attribute, you must supply a vector. In a production environment, your application calls the same embedding model it called to generate the embedded vectors in your documents to generate a vector for the query value.

For this example, you can use embedded vectors in the rgb_questions.json file that’s in the color_data_2vectors.zip file. This file contains a question attribute containing a search prompt for a particular color. The following query gets a single document from the rgb_questions collection in the vector-sample bucket’s color scope. It truncates the couchbase_search_query.knn.vector attribute to the first four values to improve readability.

SELECT RAW OBJECT_PUT(d, "couchbase_search_query",
  OBJECT_PUT(d.couchbase_search_query, "knn",
    ARRAY OBJECT_PUT(k, "vector",
      ARRAY_CONCAT(k.vector[0:4], ["..."])
    )
    FOR k IN d.couchbase_search_query.knn END
  )
)
FROM `vector-sample`.`color`.`rgb-questions` AS d
USE KEYS ["#FFEFD5"];

The result of the query shows the content of one of the documents:

[{
    "couchbase_search_query": {
        "fields": [
            "*"
        ],
        "knn": [{
            "field": "embedding_vector_dot",
            "k": 3,
            "vector": [
                0.005115953739732504,
                0.004331615287810564,
                0.014279481954872608,
                0.000619320897385478,
                "..."
            ]
        }],
        "query": {
            "match_none": {}
        },
        "sort": [
            "-_score"
        ]
    },
    "embedding_model": "text-embedding-ada-002-v2",
    "id": "#FFEFD5",
    "question": "What is the name of the color that is reminiscent of a tropical fruit and has a calming effect, often used in fashion and interior design?",
    "wanted_similar_color_from_search": "papaya whip"
}]

The couchbase_search_query.knn.vector attribute contains the embedded vector for the question attribute.

This example queries the embedding_vector_dot column. It appears here with most of the 1536 vectors omitted:

SELECT b.color, b.description from `rgb` AS b 
order by APPROX_VECTOR_DISTANCE(b.embedding_vector_dot, 
[
            0.005115953739732504,
            0.004331615287810564,
            0.014279481954872608,
            /* long list of vector values omitted */
            -0.005022349301725626,
            0.002007648814469576,
            -0.03757078945636749
          ], "DOT")  LIMIT 3;

Click View to see and copy the entire query with all the vectors.

Another option is to import the rgb_questions.json file into another collection in the vector-sample bucket’s color scope named rgb-questions. Then you can use a subquery to get the vectors for the question and use it in your query of the rgb collection’s embedding_vector_dot attribute:

WITH question_vec AS (
        SELECT RAW couchbase_search_query.knn[0].vector  
        from `vector-sample`.`color`.`rgb-questions` 
        WHERE meta().id = "#FFEFD5")
SELECT b.color, b.description from `rgb` AS b 
order by APPROX_VECTOR_DISTANCE(b.embedding_vector_dot, 
question_vec[0], "DOT")  LIMIT 3;

In either case, the results of the query are the same:

[
  {
    "color": "cantalope",
    "description": "The color cantaloupe is a soft and soothing shade that evokes feelings of calmness
    and relaxation. It is a refreshing hue that brings to mind the juicy and sweet fruit it is named
    after. This delicate color is a pale orange with hints of pink, giving it a subtle and gentle
    appearance. It is a perfect color for creating a peaceful and tranquil atmosphere."
  },
  {
    "color": "papaya whip",
    "description": "Papaya whip is a soft and mellow color that can be described as a light shade of
    peach or coral. It has a calming and soothing effect, similar to the tropical fruit it is named
    after. This color is perfect for creating a warm and inviting atmosphere, and it pairs well with
    other pastel shades or neutral tones. Papaya whip is a versatile color that can be used in both
    fashion and interior design, adding a touch of elegance and sophistication to any space."
  },
  {
    "color": "apricot",
    "description": "Apricot is a warm and inviting color, reminiscent of the soft glow of a sunset. It
    has the ability to soften the harshness of other colors and enliven any space it is used in. It is a
    delicate and soothing hue, perfect for creating a cozy and welcoming atmosphere."
  }
]

The second result, the color papaya whip, matches the rgb_questions collection’s wanted_similar_color_from_search attribute.

Adding a Scalar

Using additional scalar fields in your search can improve your results and reduce the overhead of performing a vector search. For example, filtering on an additional scalar field reduces the number of vectors an index scan has to compare. Searching for scalar values requires less resources than vector searches.

The example that created the color_desc_idx index added fields in addition to the embedding_vector_dot vector field. The following example adds a filter based on the brightness field to reduce the number of vectors that get compared and also improve the results.

The version of the query that performs a subquery of the rgb-questions to get the vector value is:

WITH question_vec AS (
        SELECT RAW couchbase_search_query.knn[0].vector  
        from `vector-sample`.`color`.`rgb-questions` 
        WHERE meta().id = "#FFEFD5")
SELECT b.color, b.description, b.brightness from `rgb` AS b 
WHERE b.brightness > 190.0
order by APPROX_VECTOR_DISTANCE(b.embedding_vector_dot, 
question_vec[0], "DOT")  LIMIT 3;

The truncated version of the query is:

SELECT b.color, b.description from `rgb` AS b 
WHERE brightness > 190.0
ORDER BY APPROX_VECTOR_DISTANCE(b.embedding_vector_dot, 
[
            0.005115953739732504,
            0.004331615287810564,
            0.014279481954872608,
            /* long list of vector values omitted */
            -0.005022349301725626,
            0.002007648814469576,
            -0.03757078945636749
          ], "DOT")  LIMIT 3;

Click View to see and copy the entire query with all the vectors.

The results show that this query moves the papaya whip entry to the top.

[{
        "color": "papaya whip",
        "description": "Papaya whip is a soft and mellow color that can be described as a light shade of peach or coral. It has a calming and soothing effect, similar to the tropical fruit it is named after. This color is perfect for creating a warm and inviting atmosphere, and it pairs well with other pastel shades or neutral tones. Papaya whip is a versatile color that can be used in both fashion and interior design, adding a touch of elegance and sophistication to any space.",
        "brightness": 240.82
    },
    {
        "color": "pale turquoise",
        "description": "Pale turquoise is a delicate and soothing color that can be described as a soft blend of blue and green. It has a calming effect and can evoke feelings of tranquility and serenity. The color is often associated with the ocean and can bring to mind images of clear, tropical waters. It has a gentle and subtle quality, making it a popular choice for creating a peaceful and serene atmosphere.",
        "brightness": 219.163
    },
    {
        "color": "light green",
        "description": "Light green is a calming and refreshing color that evokes feelings of tranquility and new beginnings. It is a delicate shade that is often associated with nature and growth. The softness of this color can bring a sense of balance and harmony to any space, making it a popular choice for interior design. Light green is also known to have a rejuvenating effect, making it a perfect color for relaxation and self-care. Its gentle hue can bring a sense of peace and serenity to the mind, body, and soul.",
        "brightness": 199.178
    }
]