Working with Vector Search

      +
      Use Vector Search with Full Text Search and Query.

      To configure a project to use vector search, follow the installation instructions to add the Vector Search extension.

      You must install Couchbase Lite to use the Vector Search extension.

      Create a Vector Index

      This method shows how you can create a vector index using the Couchbase Lite Vector Search extension.

              // Create a vector index configuration with a document property named "vector",
              // 3 dimensions, and 100 centroids. Customize the encoding, the distance metric,
              // the number of probes, and the training size.
              // Note: Free the created encoding using CBLVectorEncoding_Free after creating the index.
              CBLVectorIndexConfiguration config{};
              config.expressionLanguage = kCBLN1QLLanguage;
              config.expression = FLStr("vector");
              config.dimensions = 3;
              config.centroids = 100;
              config.metric = kCBLDistanceMetricCosine;
              config.numProbes = 8;
              config.encoding = CBLVectorEncoding_CreateNone();
              config.minTrainingSize = 2500;
              config.maxTrainingSize = 5000;

      First, initialize the config object with the VectorIndexConfiguration() method with the following parameters:

      • The expression of the data as a vector.

      • The width or dimensions of the vector index is set to 3.

      • The amount of centroids is set to 100. This means that there will be one hundred buckets with a single centroid each that gathers together similar vectors.

      You can also alter some optional config settings such as encoding. From there, you can create an index within a given collection, using the previously generated config object.

      The number of vectors, the width or dimensions of the vectors and the training size can incur high CPU and memory costs as the size of each variable increases. This is because the training vectors have to be resident on the machine.

      Vector Index Configuration

      The table below displays the different configurations you can modify within your VectorIndexConfiguration() function. For more information on specific configurations, see Vector Search.

      Table 1. Vector Index Configuration Options
      Configuration Name Is Required Default Configuration Further Information

      Expression

      yes

      No default

      A SQL++ expression indicating where to get the vectors. A document property for embedded vectors or prediction() to call a registered Predictive model.

      Number of Dimensions

      yes

      No default

      2-4096

      Number of Centroids

      yes

      No default

      1-64000. The general guideline is an approximate square root of the number of documents

      Distance Metric

      no

      Squared Euclidean Distance (euclideanSquared)

      You can set the following alternates as your Distance Metric:

      • cosine (1 - Cosine similarity)

      • Euclidean

      • dot (negated dot product)

      Encoding

      no

      Scalar Quantizer(SQ) or SQ-8 bits

      There are three possible configurations:

      • None No compression, No data loss

      • Scalar Quantizer (SQ) or SQ-8 bits (Default) Reduces the number of bits per dimension

      • Product Quantizer (PQ) Reduces the number of dimensions and bits per dimension

      Training Size

      no

      The default values for both the minimum and maximum training size is zero. The training size is calculated based on the number of Centroids and the encoding type.

      The guidelines for the minimum and maximum training size are as follows:

      • The minimum training size is set to 25x the number of Centroids or 2 PQ’s bits when PQ is used

      • The maximum training size is set to 256x the number of Centroids or 2 PQ’s bits when PQ is used

      NumProbes

      no

      The default value is 0. The number of Probes is calculated based on the number of Centroids

      A guideline for setting a custom number of probes is at least 8 or 0.5% the number of Centroids

      isLazy

      no

      False

      Setting the value to true will enable lazy mode for the vector index

      Altering the default training sizes could be detrimental to the accuracy of returned results produced by the model and total computation time.

      Generating Vectors

      You can use the following methods to generate vectors in Couchbase Lite:

      1. You can call a Machine Learning(ML) model, and embed the generated vectors inside the documents.

      2. You can use the prediction() function to generate vectors to be indexed for each document at the indexing time.

      3. You can use Lazy Vector Index (lazy index) to generate vectors asynchronously from remote ML models that may not always be reachable or functioning, skipping or scheduling retries for those specific cases.

      Below are example configurations of the previously mentioned methods.

      Create a Vector Index with Embeddings

      This method shows you how to create a Vector Index with embeddings.

              CBLError err {};
              // Get the collection object named "colors" in the default scope.
              CBLCollection* collection =
                  CBLDatabase_Collection(database, FLStr("colors"), FLStr("_default"), &err);
              if (collection == nullptr) { throw std::domain_error("No Collection Found"); }
      
              // Create a vector index configuration with a document property named "vector",
              // 3 dimensions, and 100 centroids.
              CBLVectorIndexConfiguration config {};
              config.expressionLanguage = kCBLN1QLLanguage;
              config.expression = FLStr("vector");
              config.dimensions = 3;
              config.centroids = 100;
      
              // Create a vector index from the configuration with the name "colors_index".
              bool result = CBLCollection_CreateVectorIndex(collection, FLStr("colors_index"), config, &err);
              if (!result) { throw std::domain_error("Create Index Error"); }
      
              CBLCollection_Release(collection);
      1. First, create the standard configuration, setting up an expression, number of dimensions and number of centroids for the vector embedding.

      2. Next, create a vector index, colors_index, on a collection and pass it our configuration.

      Create Vector Index Embeddings from a Predictive Model

      This method generates vectors to be indexed for each document at the index time by using the prediction() function. The key difference to note is that the config object uses the output of the prediction() function as the expression parameter to generate the vector index.

              // Register the predictive model named "ColorModel".
              CBL_RegisterPredictiveModel(FLStr("ColorModel"), model);
      
              CBLError err {};
              // Get the collection object named "colors" in the default scope.
              CBLCollection* collection =
                  CBLDatabase_Collection(database, FLStr("colors"), FLStr("_default"), &err);
              if (collection == nullptr) { throw std::domain_error("No Collection Found"); }
      
              // Create a vector index configuration with an expression using the prediction
              // function to get the vectors from the registered predictive model.
              CBLVectorIndexConfiguration config {};
              config.expressionLanguage = kCBLN1QLLanguage;
              config.expression = FLStr("prediction(ColorModel, {\"colorInput\": color}).vector");
              config.dimensions = 3;
              config.centroids = 100;
      
              // Create a vector index from the configuration with the name "colors_index".
              bool result = CBLCollection_CreateVectorIndex(collection, FLStr("colors_index"), config, &err);
              if (!result) { throw std::domain_error("Create Index Error"); }
      
              CBLCollection_Release(collection);
      You can use less storage by using the prediction() function as the encoded vectors will only be stored in the index. However, the index time will be longer as vector embedding generation is occurring at run time.

      Create a Lazy Vector Index

      Lazy indexing is an alternate approach to using the standard predictive model with regular vector indexes which handle the indexing process automatically. You can use lazy indexing to use a ML model that is not available locally on the device and to create vector indexes without having vector embeddings in the documents.

              // Creating a lazy vector index using the document's 'color' key.
              // The value of this key will be used to compute a vector when updating the index.
              CBLVectorIndexConfiguration config{};
              config.expressionLanguage = kCBLN1QLLanguage;
              config.expression = FLStr("color");
              config.dimensions = 3;
              config.centroids = 100;
              config.isLazy = true;

      You can enable lazy vector indexing by setting the isLazy property to true in your vector index configuration.

      Lazy Vector Indexing is opt-in functionality, the isLazy property is set to false by default.

      Updating the Lazy Index

      Below is an example of how you can update your lazy index.

              // Get the collection object
              CBLError err {};
              CBLCollection* collection =
                  CBLDatabase_Collection(database, FLStr("colors"), FLStr("_default"), &err);
              if (collection == nullptr) { throw std::domain_error("No Collection Found"); }
      
              // Get the index object
              CBLQueryIndex* index = CBLCollection_GetIndex(collection, FLStr("colors_index"), &err);
              if (!index) { throw std::domain_error("Index Not Found"); }
      
              while (true) {
                  // Start an update on it (in this case, limit to 50 entries at a time)
                  CBLIndexUpdater* updater = CBLQueryIndex_BeginUpdate(index, 50, &err);
                  if (!updater) {
                      if (err.code != 0) { throw std::domain_error("Error Begin Update"); }
                      // If updater is NULL and no error, that means there are no more entries to process
                      break;
                  }
      
                  for (size_t i = 0; i < CBLIndexUpdater_Count(updater); i++) {
                      // The value type will depend on the expression you have set in your index.
                      // In this example, it is a string property.
                      FLString value = FLValue_AsString(CBLIndexUpdater_Value(updater, i));
                      std::string colorString = std::string((char*)value.buf, value.size);
      
                      std::vector<float> vector;
                      try {
                          // Call a MLModel to get a vector.
                          vector = Color::getVector(colorString);
                      } catch (const TransientError& e) {
                          // Bad connection? Corrupted over the wire? Something bad happened
                          // and the vector cannot be generated at the moment. So skip
                          // this entry. The next time CBLQueryIndex_BeginUpdate is called,
                          // it will be considered again.
                          CBLIndexUpdater_SkipVector(updater, i);
                      } catch (...) {
                          // An unexpected error happened.
                          CBLIndexUpdater_Release(updater);
                          throw std::domain_error("Error Getting a Vector");
                      }
      
                      bool success;
                      if (!vector.empty()) {
                          // The size of the vector must match the number of dimensions set in the index.
                          // Otherwise, an error will be returned.
                          success = CBLIndexUpdater_SetVector(updater, i, vector.data(),vector.size(),  &err);
                      } else {
                          // No vector applicable. Calling SetVector with NULL will
                          // cause the underlying document to NOT be indexed
                          success = CBLIndexUpdater_SetVector(updater, i, nullptr, 0, &err);
                      }
                      if (!success) {
                          CBLIndexUpdater_Release(updater);
                          throw std::domain_error("Error Setting a Vector");
                      }
                  }
      
                  // This writes the vectors to the index. You MUST have either set or
                  // skipped all the values inside the updater or this call will return an error.
                  if (!CBLIndexUpdater_Finish(updater, &err)) {
                      CBLIndexUpdater_Release(updater);
                      throw std::domain_error("Error Finish Updating");
                  }
      
                  CBLIndexUpdater_Release(updater);
              }
      
              CBLQueryIndex_Release(index);

      You procedurally update the vectors in the index by looping through the vectors in batches until you reach the value of the limit parameter.

      The update process follows the following sequence:

      1. Get a value for the updater.

        1. If the there is no value for the vector, handle it. In this case, the vector will be skipped and considered the next time BeginUpdate() is called.

          A key benefit of lazy indexing is that the indexing process continues if a vector fails to generate. For standard vector indexing, this will cause the affected documents to be dropped from the indexing process.
      2. Set the vector from the computed vector derived from the updater value and your ML model.

        1. If there is no value for the vector, this will result in the underlying document to not be indexed.

      3. Once all vectors have completed the update loop, finish updating.

      CBLIndexUpdater_Release() will throw an error if any values inside the updater have not been set or skipped.

      Vector Search SQL++ Support

      Couchbase Lite currently supports Hybrid Vector Search and the APPROX_VECTOR_DISTANCE() function.

      Similar to the Full Text Search match() function, the APPROX_VECTOR_DISTANCE() function and Hybrid Vector Search cannot use the OR expression with the other expressions in the related WHERE clause.

      You can use Hybrid Vector Search (Hybrid Search) to perform vector search in conjunction with regular SQL++ queries. With Hybrid Search, you perform vector search on documents that have already been filtered based on criteria specified in the WHERE clause.

      A LIMIT clause is required for non-hybrid Vector Search, this avoids a slow, exhaustive unlimited search of all possible vectors.

      Hybrid Vector Search with Full Text Match

      Below is an example of using Hybrid Search with the Full Text match() function.

              // Create a hybrid vector search query by using ORDER BY and WHERE clause.
              CBLError err{};
              const char* sql = "SELECT meta().id, color "
                                "FROM _default.colors "
                                "WHERE MATCH(color_desc_index, $text) "
                                "ORDER BY approx_vector_distance(vector, $vector) "
                                "LIMIT 8";
      
              CBLQuery* query = CBLDatabase_CreateQuery(database, kCBLN1QLLanguage,
                                                        FLStr(sql),
                                                        nullptr, &err);
      
              // Use ML model to get a vector (an array of floats) for the input color.
              std::vector colorVector = Color::getVector("FF00AA");
      
              // Set the vector array to the parameter "$vector"
              auto colorArray = FLMutableArray_New();
              for (auto val : colorVector) {
                  FLMutableArray_AppendFloat(colorArray, val);
              }
      
              // Set the vector array to the parameter "$vector".
              auto params = FLMutableDict_New();
              // Set the vector array to the parameter "$vector".
              FLMutableDict_SetArray(params, FLStr("vector"), colorArray);
              // Set the vector array to the parameter "$text".
              FLMutableDict_SetString(params, FLStr("$text"), FLStr("vibrant"));
              CBLQuery_SetParameters(query, params);
      
              FLMutableArray_Release(colorArray);
              FLMutableDict_Release(params);
      
              // Execute the query:
              auto results = CBLQuery_Execute(query, &err);
              if (!results) {
                  throw std::domain_error("Invalid Query");
              }
      
              while(CBLResultSet_Next(results)) {
                  // Process result
              }

      Below is an example of using Hybrid Search with an array of vectors generated by the Prediction() function at index time.

              // Create a hybrid vector search query by using ORDER BY and WHERE clause.
              CBLError err{};
              const char* sql =
              "SELECT meta().id, color "
              "FROM _default.colors "
              "WHERE saturation > 0.5 "
              "ORDER BY approx_vector_distance(prediction(ColorModel, {\"colorInput\": color}).vector, $vector) "
              "LIMIT 8";
      
              CBLQuery* query = CBLDatabase_CreateQuery(database, kCBLN1QLLanguage,
                                                        FLStr(sql),
                                                        nullptr, &err);
      
              // Use ML model to get a vector (an array of floats) for the input color.
              std::vector colorVector = Color::getVector("FF00AA");
      
              // Set the vector array to the parameter "$vector"
              auto colorArray = FLMutableArray_New();
              for (auto val : colorVector) {
                  FLMutableArray_AppendFloat(colorArray, val);
              }
      
              // Set the vector array to the parameter "$vector".
              auto params = FLMutableDict_New();
              FLMutableDict_SetArray(params, FLSTR("vector"), colorArray);
              CBLQuery_SetParameters(query, params);
      
              FLMutableArray_Release(colorArray);
              FLMutableDict_Release(params);
      
              // Execute the query:
              auto results = CBLQuery_Execute(query, &err);
              if (!results) {
                  throw std::domain_error("Invalid Query");
              }
      
              while(CBLResultSet_Next(results)) {
                  // Process result
              }

      APPROX_VECTOR_DISTANCE(vector-expr, target-vector, [metric], [nprobes], [accurate])

      If you use a different distance metric in the APPROX_VECTOR_DISTANCE() function from the one configured in the index, you will receive an error when compiling the query.
      Parameter Is Required Description

      vector-expr

      yes

      The expression returning a vector (NOT Index Name). Must match the expression specified in the vector index exactly.

      target-vector

      yes

      The target vector.

      metric

      no

      Values : "EUCLIDEAN_SQUARED", “L2_SQUARED”, “EUCLIDEAN”, “L2”, ”COSINE”, “DOT”. If not specified, the metric set in the vector index is used. If specified, the metric must match with the metric set in the vector index. This optional parameter allows multiple indexes to be attached to the same field in a document.

      nprobes

      no

      Number of buckets to search for the nearby vectors. If not specified, the nprobes set in the vector index is used.

      accurate

      no

      If not present, false will be used, which means that the quantized/encoded vectors in the index will be used for calculating the distance.

      IMPORTANT: Only accurate = false is supported

      Use APPROX_VECTOR_DISTANCE()

              // Create a query by using the approx_vector_distance() in the WHERE clause.
              CBLError err{};
              const char* sql = "SELECT id, color "
                                "FROM _default.colors "
                                "WHERE approx_vector_distance(vector, $vector) < 0.5 "
                                "LIMIT 8";
              
              CBLQuery* query = CBLDatabase_CreateQuery(database, kCBLN1QLLanguage,
                                                        FLStr(sql),
                                                        nullptr, &err);
      
              // Use ML model to get a vector (an array of floats) for the input color.
              std::vector colorVector = Color::getVector("FF00AA");
      
              // Set the vector array to the parameter "$vector"
              auto colorArray = FLMutableArray_New();
              for (auto val : colorVector) {
                  FLMutableArray_AppendFloat(colorArray, val);
              }
      
              // Set the vector array to the parameter "$vector".
              auto params = FLMutableDict_New();
              FLMutableDict_SetArray(params, FLSTR("vector"), colorArray);
              CBLQuery_SetParameters(query, params);
      
              FLMutableArray_Release(colorArray);
              FLMutableDict_Release(params);
      
              // Execute the query:
              auto results = CBLQuery_Execute(query, &err);
              if (!results) {
                  throw std::domain_error("Invalid Query");
              }
      
              while(CBLResultSet_Next(results)) {
                  // Process result
              }

      This function returns the approximate distance between a given vector, typically generated from your ML model, and an array of vectors with size equal to the LIMIT parameter, collected by a SQL++ query using APPROX_VECTOR_DISTANCE().

      Prediction with APPROX_VECTOR_DISTANCE()

      Below is an example of using APPROX_VECTOR_DISTANCE() with an array of vectors generated by the Prediction() function at index time.

              // Create a vector search query that uses prediction() for computing vectors.
              CBLError err{};
              const char* sql = "SELECT id, color "
                                "FROM _default.colors "
                                "ORDER BY approx_vector_distance(prediction(ColorModel, {\"colorInput\": color}).vector, $vector) "
                                "LIMIT 8";
      
              CBLQuery* query = CBLDatabase_CreateQuery(database, kCBLN1QLLanguage,
                                                        FLStr(sql),
                                                        nullptr, &err);
      
              // Use ML model to get a vector (an array of floats) for the input color.
              std::vector colorVector = Color::getVector("FF00AA");
      
              // Set the vector array to the parameter "$vector"
              auto colorArray = FLMutableArray_New();
              for (auto val : colorVector) {
                  FLMutableArray_AppendFloat(colorArray, val);
              }
      
              // Set the vector array to the parameter "$vector".
              auto params = FLMutableDict_New();
              FLMutableDict_SetArray(params, FLSTR("vector"), colorArray);
              CBLQuery_SetParameters(query, params);
      
              FLMutableArray_Release(colorArray);
              FLMutableDict_Release(params);
      
              // Execute the query:
              auto results = CBLQuery_Execute(query, &err);
              if (!results) {
                  throw std::domain_error("Invalid Query");
              }
      
              while(CBLResultSet_Next(results)) {
                  // Process result
              }