Vector Functions
- reference
- Couchbase Server 8.0
Vector functions enable you to work with vector values.
Vector functions include similarity functions to find the distance between two vectors, functions that check for a vector value, and functions that modify vector values.
For more information about vectors and vector indexes, see Use Vector Indexes for AI Applications.
APPROX_VECTOR_DISTANCE(vec
, queryvec
, metric
[, nprobes
[, rerank
[, topNScan
]]])
This function has an alias ANN_DISTANCE().
Description
Finds the approximate distance between a provided vector and the content of a specified field that contains vector embeddings.
This function works best with a Hyperscale Vector index or Composite Vector index. If a query contains this function, and all of the following are true:
-
The cluster has a Hyperscale Vector index or a Composite Vector index with a vector index key which is the same as the vector field referenced by the function
-
The vector index key uses a similarity setting which is the same as the distance metric referenced by the function
-
The vector index key has the same dimension as the vector provided by the function
… then the Query optimizer selects that Hyperscale Vector index or Composite Vector index for use with the query containing this function.
This function is faster, but less precise than VECTOR_DISTANCE(). You should use this function in your production queries.
Arguments
- vec
-
The name of a field that contains vector embeddings. The field must contain an array of floating point numbers, or a base64 encoded string.
- queryvec
-
An array of floating point numbers, or a base64 encoded string, representing the vector value to search for in the vector field.
- metric
-
A string representing the distance metric to use when comparing the vectors. To select a Hyperscale Vector index or Composite Vector index for the query, the distance metric should match the
similarity
setting that you used when you created the index.COSINE DOT L2
EUCLIDEANL2_SQUARED
EUCLIDEAN_SQUARED - nprobes
-
[Optional] An integer representing the number of centroids to probe for matching vectors. If the Query Service selects a Hyperscale Vector index or Composite Vector index for the query, this option defaults to the
scan_nprobes
setting that you used when you created the index. If an invalid value is provided, defaults to1
. - rerank
-
[Optional; can only be used when
nprobes
is specified] A Boolean. Iffalse
, the function uses quantized vectors. Iftrue
, the function uses full vectors to reorder the results. The default isfalse
. - topNScan
-
[Optional; can only be used when
nprobes
andrerank
are specified] This option only applies if using a Hyperscale Vector index. A positive integer representing the number of records to scan. The default is0
, meaning the function uses the indexer default.
Examples
To try the examples in this section, you must do the following:
-
Install the
rgb
andrgb-questions
collections from the supplied vector sample, as described in Prerequisites. -
Create a Composite Vector index in the
rbg
collection on the field namedcolorvect_l2
, as described in CREATE INDEX Example 6. -
Create a Hyperscale Vector index in the
rbg
collection on the field namedembedding-vector-dot
, as described in CREATE VECTOR INDEX Example 1.
This example finds the approximate vector distance between a query vector and three different embedded vectors.
WITH data AS ([
{"vector": [1, 2, 3, 4], "similarity": "identical"},
{"vector": [1, 2, 3, 5], "similarity": "close"},
{"vector": [6, 7, 8, 9], "similarity": "distant"}
])
SELECT
similarity,
APPROX_VECTOR_DISTANCE(vector, [1, 2, 3, 4], "COSINE") AS cosine,
APPROX_VECTOR_DISTANCE(vector, [1, 2, 3, 4], "DOT") AS dot,
APPROX_VECTOR_DISTANCE(vector, [1, 2, 3, 4], "L2") AS l2,
APPROX_VECTOR_DISTANCE(vector, [1, 2, 3, 4], "L2_SQUARED") AS l2_squared
FROM data;
The results show how the distance changes as the similarity decreases.
[
{
"similarity": "identical",
"cosine": 0,
"dot": -30,
"l2": 0,
"l2_squared": 0
},
{
"similarity": "close",
"cosine": 0.00600091145203363,
"dot": -34,
"l2": 1,
"l2_squared": 1
},
{
"similarity": "distant",
"cosine": 0.0369131753138463,
"dot": -80,
"l2": 10,
"l2_squared": 100
}
]
Compare this with the result of VECTOR_DISTANCE() Example 1. In this case, the results are identical because the query is not using a Hyperscale Vector index or Composite Vector index.
This example finds the colors from the rgb
collection that are similar to gray, which has an RGB value of [128, 128, 128]
.
SELECT b.color, b.colorvect_l2, b.brightness from `rgb` AS b
ORDER BY APPROX_VECTOR_DISTANCE(b.colorvect_l2,[128,128,128],"L2")
LIMIT 5;
The top result is the entry for gray. The other results are all shades of gray:
[
{
"color": "grey",
"colorvect_l2": [
128,
128,
128
],
"brightness": 128
},
{
"color": "slate gray",
"colorvect_l2": [
112,
128,
144
],
"brightness": 125.04
},
{
"color": "light slate gray",
"colorvect_l2": [
119,
136,
153
],
"brightness": 132.855
},
{
"color": "light gray",
"colorvect_l2": [
144,
144,
144
],
"brightness": 144
},
{
"color": "dim gray",
"colorvect_l2": [
105,
105,
105
],
"brightness": 105
}
]
This example compares embedded vector values.
The query finds the colors from the rgb
collection whose descriptions are most similar to the following presupplied question:
What is the color that is often linked to feelings of peace and tranquility, and is reminiscent of the clear sky on a calm day?
WITH question_vec AS (
SELECT RAW couchbase_search_query.knn[0].vector (1)
FROM `vector-sample`.`color`.`rgb-questions`
WHERE meta().id = "#87CEEB"
),
colors AS (
SELECT b.color
FROM `vector-sample`.`color`.`rgb` AS b
ORDER BY VECTOR_DISTANCE(b.embedding_vector_dot, question_vec[0], "l2") (2)
LIMIT 10 )
SELECT RAW colors;
1 | The vector field in the rgb-questions collection contains the embedded vectors associated with the presupplied questions. |
2 | The embedding_vector_dot field in the rgb collection contains the embedded vectors associated with the color descriptions. |
The query returns 10 colors where the embedded vector associated with the color description is most similar to the embedded vector associated with the presupplied question.
[
[{
"color": "deep sky blue"
},
{
"color": "sky blue"
},
{
"color": "light sky blue"
},
{
"color": "pale turquoise"
},
{
"color": "blue"
},
{
"color": "slate blue"
},
{
"color": "light cyan"
},
{
"color": "cadet blue"
},
{
"color": "light blue"
},
{
"color": "medium blue"
}
]
]
Compare this with the result of VECTOR_DISTANCE() Example 2. In this case, the approximate vector distance does not give accurate results.
This example improves on APPROX_VECTOR_DISTANCE() Example 3 by increasing the number of centroids to probe.
WITH question_vec AS (
SELECT RAW couchbase_search_query.knn[0].vector
FROM `vector-sample`.`color`.`rgb-questions`
WHERE meta().id = "#87CEEB"
),
colors AS (
SELECT b.color
FROM `vector-sample`.`color`.`rgb` AS b
ORDER BY APPROX_VECTOR_DISTANCE(b.embedding_vector_dot, question_vec[0], "l2", 4)
LIMIT 10 )
SELECT RAW colors;
[
[{
"color": "deep sky blue"
},
{
"color": "sky blue"
},
{
"color": "light sky blue"
},
{
"color": "pale turquoise"
},
{
"color": "light cyan"
},
{
"color": "slate blue"
},
{
"color": "blue"
},
{
"color": "cadet blue"
},
{
"color": "light blue"
},
{
"color": "medium blue"
}
]
]
Compare this with the result of VECTOR_DISTANCE() Example 2. The approximate vector distance now gives much more accurate results.
This example is similar to APPROX_VECTOR_DISTANCE() Example 4, but also uses reranking to improve its accuracy.
The query finds colors from the rgb
collection whose descriptions are most similar to the following presupplied question:
What is a soft and gentle hue that can add warmth and brightness to a room?
WITH question_vec AS (
SELECT RAW couchbase_search_query.knn[0].vector (1)
FROM `vector-sample`.`color`.`rgb-questions`
WHERE meta().id = "#FFFFE0")
SELECT b.color, b.description, b.id
FROM `vector-sample`.`color`.`rgb` AS b
ORDER BY APPROX_VECTOR_DISTANCE(b.embedding_vector_dot, question_vec[0], "l2", 4, TRUE) (2)
LIMIT 3;
1 | The vector field in the rgb-questions collection contains the embedded vectors associated with the presupplied questions. |
2 | The embedding_vector_dot field in the rgb collection contains the embedded vectors associated with the color descriptions. |
The query returns 3 colors where the embedded vector associated with the color description is most similar to the embedded vector associated with the presupplied question.
[{
"color": "peach",
"description": "Peach is a soft and warm color that can enliven any space. It has
a delicate and gentle quality, like the softness of a peach's skin.
This color can soften the harshness of other colors and bring a sense
of warmth and comfort. It is a versatile color that can be both calming
and invigorating, making it a popular choice in interior design. Peach
is a color that evokes feelings of happiness and positivity, making it
a perfect addition to any room.",
"id": "#FF8C3C"
},
{
"color": "light yellow",
"description": "Light yellow is a delicate and gentle color that can soften the overall
tone of a room. It has a bright and cheerful quality that can brighten up
any space. This color also has the ability to illuminate a room, making it
feel more open and airy. Light yellow is a perfect choice for creating a
warm and inviting atmosphere.",
"id": "#FFFFE0"
},
{
"color": "apricot",
"description": "Apricot is a warm and inviting color, reminiscent of the soft glow of a
sunset. It has the ability to soften the harshness of other colors and
enliven any space it is used in. It is a delicate and soothing hue, perfect
for creating a cozy and welcoming atmosphere.",
"id": "#FB8737"
}
]
For more information and examples, see Hyperscale Vector Index Reranking and Full Vector Persistence.
DECODE_VECTOR(vector
[, byte_order
])
This function has an alias VECTOR_DECODE().
Description
Reverses the encoding done by the ENCODE_VECTOR() function.
Arguments
- vector
-
String, or any SQL++ expression that evaluates to a string, representing the base64 encoding of a vector value.
- byte_order
-
[Optional] A boolean which determines the byte order of the vector value. If
true
, it’s big-endian. Iffalse
, it’s little-endian. The default isfalse
.
Example
The following query decodes the base64 encoding of a vector value using two different byte orders.
SELECT DECODE_VECTOR("AACAPwAAAEAAAEBAAACAQA==") AS little_endian,
DECODE_VECTOR("P4AAAEAAAABAQAAAQIAAAA==", true) AS big_endian;
[
{
"little_endian": [
1,
2,
3,
4
],
"big_endian": [
1,
2,
3,
4
]
}
]
ENCODE_VECTOR(vector
[, byte_order
])
This function has an alias VECTOR_ENCODE().
Description
Returns the base64 encoding of a vector value.
Arguments
- vector
-
An array of floating point numbers, or any SQL++ expression that evaluates to an array of floating point numbers.
- byte_order
-
[Optional] A boolean which determines the byte order of the vector value. If
true
, it’s big-endian. Iffalse
, it’s little-endian. The default isfalse
.
Example
The following query encodes an array of floating point numbers using two different byte orders.
SELECT ENCODE_VECTOR([1, 2, 3, 4]) AS little_endian,
ENCODE_VECTOR([1, 2, 3, 4], true) AS big_endian;
[
{
"little_endian": "AACAPwAAAEAAAEBAAACAQA==",
"big_endian": "P4AAAEAAAABAQAAAQIAAAA=="
}
]
ISVECTOR(vector
, dimension
, format
)
Description
Checks if the supplied expression is an array of floating point numbers with the specified number of dimensions. This can be used to determine whether a field contains a vector value.
Arguments
- vector
-
An array of floating point numbers, or any SQL++ expression that evaluates to an array of floating point numbers.
- dimension
-
An integer representing the number of dimensions.
- format
-
A string. This argument must always be present and must have the value
"float32"
.
Return Value
Returns true
if the expression is an array of floating point numbers with the specified number of dimensions.
Examples
To try the examples in this section, you must install the rgb
and rgb-questions
collections from the supplied vector sample, as described in Prerequisites.
SELECT ISVECTOR([1, 2, 3, 4], 4, "float32") as vector,
ISVECTOR([1, 2, 3, 4], 3, "float32") as wrong_dimension,
ISVECTOR(["a", "b", "c", "d"], 4, "float32") as wrong_values;
[
{
"vector": true,
"wrong_dimension": false,
"wrong_values": false
}
]
Check whether the specified fields in the rgb
collection contain vector values.
SELECT ISVECTOR(description, 1, "float32") AS description,
ISVECTOR(colorvect_l2, 3, "float32") AS colorvect_l2,
ISVECTOR(embedding_vector_dot, 1536, "float32") AS embedding_vector_dot
FROM `vector-sample`.color.rgb LIMIT 1;
[{
"description": false,
"colorvect_l2": true,
"embedding_vector_dot": true
}]
The results show that the description
field is not a vector field. The colorvect_l2
and embedding_vector_dot
fields are vector fields, with the specified number of dimensions.
NORMALIZE_VECTOR(vector
)
This function has aliases NORMALISE_VECTOR(), VECTOR_NORMALISE(), and VECTOR_NORMALIZE().
Description
Normalizes a vector. This function changes the magnitude of a vector, but not its direction, so that the vector has unit length. This is useful in cases where you only need the direction of the vector, not its magnitude.
To normalize a vector \$x\$, the function first calculates the length of the vector, \$|x|\$. This is the square root of the sum of the squares of each component of the vector.
To find the normalized vector, \$hat x\$, the function then divides each component of the vector by the length of the vector.
VECTOR_DISTANCE(vec
, queryvec
, metric
)
This function has an alias KNN_DISTANCE().
Description
Finds the exact distance between a provided vector and the content of a specified field that contains vector embeddings.
This function does not use a Hyperscale Vector index or Composite Vector index to perform the comparison. Instead, it performs a brute-force search for similar vectors.
This function is slower, but more precise than APPROX_VECTOR_DISTANCE(). You should use this function to check the accuracy of your production queries, and adjust the index and query settings to improve the recall accuracy.
Arguments
- vec
-
The name of a field that contains vector embeddings. The field must contain an array of floating point numbers, or a base64 encoded string.
- queryvec
-
An array of floating point numbers, or a base64 encoded string, representing the vector value to search for in the vector field.
- metric
-
A string representing the distance metric to use when comparing the vectors.
COSINE DOT L2
EUCLIDEANL2_SQUARED
EUCLIDEAN_SQUARED
Examples
To try the examples in this section, you must install the rgb
and rgb-questions
collections from the supplied vector sample, as described in Prerequisites.
The following query finds the exact vector distance between a query vector and three different embedded vectors.
WITH data AS ([
{"vector": [1, 2, 3, 4], "similarity": "identical"},
{"vector": [1, 2, 3, 5], "similarity": "close"},
{"vector": [6, 7, 8, 9], "similarity": "distant"}
])
SELECT
similarity,
VECTOR_DISTANCE(vector, [1, 2, 3, 4], "COSINE") AS cosine,
VECTOR_DISTANCE(vector, [1, 2, 3, 4], "DOT") AS dot,
VECTOR_DISTANCE(vector, [1, 2, 3, 4], "L2") AS l2,
VECTOR_DISTANCE(vector, [1, 2, 3, 4], "L2_SQUARED") AS l2_squared
FROM data;
The results show how the distance changes as the similarity decreases.
[
{
"similarity": "identical",
"cosine": 0,
"dot": -30,
"l2": 0,
"l2_squared": 0
},
{
"similarity": "close",
"cosine": 0.00600091145203363,
"dot": -34,
"l2": 1,
"l2_squared": 1
},
{
"similarity": "distant",
"cosine": 0.0369131753138463,
"dot": -80,
"l2": 10,
"l2_squared": 100
}
]
Compare this with the result of APPROX_VECTOR_DISTANCE() Example 1.
This example compares embedded vector values.
The query finds colors from the rgb
collection whose descriptions are most similar to the following presupplied question:
What is the color that is often linked to feelings of peace and tranquility, and is reminiscent of the clear sky on a calm day?
WITH question_vec AS (
SELECT RAW couchbase_search_query.knn[0].vector (1)
FROM `vector-sample`.`color`.`rgb-questions`
WHERE meta().id = "#87CEEB"
),
colors AS (
SELECT b.color
FROM `vector-sample`.`color`.`rgb` AS b
ORDER BY VECTOR_DISTANCE(b.embedding_vector_dot, question_vec[0], "l2") (2)
LIMIT 10 )
SELECT RAW colors;
1 | The vector field in the rgb-questions collection contains the embedded vectors associated with the presupplied questions. |
2 | The embedding_vector_dot field in the rgb collection contains the embedded vectors associated with the color descriptions. |
The query returns 10 colors where the embedded vector associated with the color description is most similar to the embedded vector associated with the presupplied question.
[
[{
"color": "deep sky blue"
},
{
"color": "sky blue"
},
{
"color": "light sky blue"
},
{
"color": "pale turquoise"
},
{
"color": "blue"
},
{
"color": "slate blue"
},
{
"color": "light cyan"
},
{
"color": "cadet blue"
},
{
"color": "light blue"
},
{
"color": "medium blue"
}
]
]
For more information and examples, see Determine Recall Rate.
Aliases
Some vector functions have aliases.
-
ANN_DISTANCE()
is an alias for APPROX_VECTOR_DISTANCE(). -
KNN_DISTANCE()
is an alias for VECTOR_DISTANCE(). -
NORMALISE_VECTOR()
is an alias for NORMALIZE_VECTOR(). -
VECTOR_DECODE()
is an alias for DECODE_VECTOR(). -
VECTOR_ENCODE()
is an alias for ENCODE_VECTOR(). -
VECTOR_NORMALISE()
is an alias for NORMALIZE_VECTOR(). -
VECTOR_NORMALIZE()
is an alias for NORMALIZE_VECTOR().