CREATE VECTOR INDEX

  • reference
  • Couchbase Server 8.0
    +

    The CREATE VECTOR INDEX statement allows you to create Hyperscale Vector indexes.

    To create secondary indexes or Composite Vector indexes, use the CREATE INDEX statement.

    Purpose

    CREATE VECTOR INDEX allows you to make multiple concurrent index creation requests. The command starts a task to create the index definition in the background. If there is an index creation task already running, the Index Service queues the incoming index creation request. CREATE VECTOR INDEX returns as soon as the index creation phase is complete.

    By default, when the index creation phase is complete, the Index Service triggers the index build phase. If you lose connectivity, the index build operation continues in the background. You can defer the index build phase using the defer_build clause. In deferred build mode, CREATE VECTOR INDEX creates the index definition, but does not trigger the index build phase. You can then build the index using the BUILD INDEX command.

    You can create multiple identical secondary indexes on a keyspace and place them on separate nodes for better index availability. In Couchbase Server Enterprise Edition, the recommended way to do this is using the num_replica option. In Couchbase Server Community Edition, you need to create multiple identical indexes and place them using the nodes option. See WITH Clause for more details.

    Hyperscale Vector indexes and Composite Vector indexes require a codebook for the vector field. The codebook is the result of sampling the dataset and is saved as part of the index metadata.

    The codebook is created as part of the BUILD INDEX process, and is not incrementally updated. If the data set changes dramatically, you must drop and rebuild the index to update the codebook.

    Prerequisites

    RBAC Privileges

    To execute the CREATE VECTOR INDEX statement, you must have the Query Manage Index privilege granted on the keyspace. For more information about user roles, see Authorization.

    Syntax

    create-vector-index ::= 'CREATE' 'VECTOR' 'INDEX' ( index-name ( 'IF' 'NOT' 'EXISTS' )? |
                            'IF' 'NOT' 'EXISTS' index-name ) 'ON' keyspace-ref
                            '(' index-key-and-attrib ')'
                            index-include? index-partition? where-clause? index-using? index-with?
    Syntax diagram: see source code listing
    index-name

    (Required) A unique name that identifies the index.

    Valid GSI index names can contain any of the following characters: A-Z a-z 0-9 # _, and must start with a letter, [A-Z a-z]. The minimum length of an index name is 1 character and there is no maximum length set for an index name. When querying, if the index name contains a # or _ character, you must enclose the index name within backticks.

    keyspace-ref

    (Required) Specifies the keyspace where the index is created. See Keyspace Reference.

    index-key-and-attrib

    (Required) Specifies the index key and index key attribute. See Index Key and Attribute.

    index-include

    (Optional) Specifies non-key fields to include in the index. See INCLUDE Clause.

    index-partition

    (Optional) Specifies index partitions. See PARTITION BY HASH Clause.

    where-clause

    (Optional) Specifies filters for a partial index. See WHERE Clause.

    index-using

    (Optional) Specifies the index type. See USING Clause.

    index-with

    (Optional) Specifies options for the index. See WITH Clause.

    IF NOT EXISTS Clause

    The optional IF NOT EXISTS clause enables the statement to complete successfully when the specified index already exists. If an index with the same name already exists within the specified keyspace, then:

    • If this clause is not present, an error is generated.

    • If this clause is present, the statement does nothing and completes without error.

    Keyspace Reference

    keyspace-ref ::= keyspace-path | keyspace-partial
    Syntax diagram: see source code listing

    Specifies the keyspace for which the index needs to be created. The keyspace reference may be a keyspace path or a keyspace partial.

    If there is a hyphen (-) inside any part of the keyspace reference, you must wrap that part of the keyspace reference in backticks (` `). See the examples on this page.

    Keyspace Path

    keyspace-path ::= ( namespace ':' )? bucket ( '.' scope '.' collection )?
    Syntax diagram: see source code listing

    If the keyspace is a named collection, or the default collection in the default scope within a bucket, the keyspace reference may be a keyspace path. In this case, the query context should not be set.

    namespace

    (Optional) An identifier that refers to the namespace of the keyspace. Currently, only the default namespace is available. If the namespace name is omitted, the default namespace in the current session is used.

    bucket

    (Required) An identifier that refers to the bucket name of the keyspace.

    scope

    (Optional) An identifier that refers to the scope name of the keyspace. If omitted, the bucket’s default scope is used.

    collection

    (Optional) An identifier that refers to the collection name of the keyspace. If omitted, the default collection in the bucket’s default scope is used.

    For example, default:`travel-sample` indicates the default collection in the default scope in the travel-sample bucket in the default namespace.

    Similarly, default:`travel-sample`.inventory.airline indicates the airline collection in the inventory scope in the travel-sample bucket in the default namespace.

    Keyspace Partial

    keyspace-partial ::= collection
    Syntax diagram: see source code listing

    Alternatively, if the keyspace is a named collection, the keyspace reference may be just the collection name with no path. In this case, you must set the query context to indicate the required namespace, bucket, and scope.

    collection

    (Required) An identifier that refers to the collection name of the keyspace.

    For example, airline indicates the airline collection, assuming the query context is set.

    Index Key and Attribute

    index-key-and-attrib ::= index-key index-vector
    Syntax diagram: see source code listing

    Hyperscale Vector indexes only have one key, which must be a vector field. The index key takes one attribute, the VECTOR keyword.

    index-key

    (Required) Specifies an index key. See Index Key.

    index-vector

    (Required) Specifies an attribute for the index key. See VECTOR Keyword.

    Index Key

    index-key ::= expr | array-expr
    Syntax diagram: see source code listing

    The index key is a SQL++ expression referring to a vector field, or an ARRAY expression on the vector field.

    expr

    The name of a vector field in the document, or a BASE64_DECODE() function on the vector field — this is necessary if the embedded vectors are stored as a base64-encoded string.

    array-expr

    An array expression on a vector field in the document. Only ALL ARRAY is supported. The FLATTEN_KEYS() function is supported, but more than one key in FLATTEN_KEYS() is not permitted. For details, see Array Indexing.

    VECTOR Keyword

    index-vector ::= 'VECTOR'
    Syntax diagram: see source code listing

    Indicates that the index key is a vector field.

    INCLUDE Clause

    index-include ::= 'INCLUDE' '(' expr ( ',' expr )* ')'
    Syntax diagram: see source code listing

    Used to include scalar fields in the index, which you can use to filter the vector search. The INCLUDE clause cannot include a vector field. For details, see Use Scalar Columns to Filter Hyperscale Vector Index Scans.

    expr

    A SQL++ expression referring to any scalar field in the document.

    PARTITION BY HASH Clause

    Used to partition the index. Index partitioning helps increase the query performance by dividing and spreading a large index of documents across multiple nodes, horizontally scaling out an index as needed. For details, see Index Partitioning.

    With Hyperscale Vector indexes and Composite Vector indexes, training is done for each index node independently, and the codebook is provided to all partitions on that node. If there are multiple partitions for an index on a node, training is only done once for all partitions. See The Importance of Index Training.

    WHERE Clause

    where-clause ::= 'WHERE' cond
    Syntax diagram: see source code listing
    cond

    Specifies WHERE clause predicates to qualify the subset of documents to include in the index.

    USING Clause

    index-using ::= 'USING' 'GSI'
    Syntax diagram: see source code listing

    The index type for a secondary index must be Global Secondary Index (GSI). The USING GSI keywords are optional and may be omitted.

    WITH Clause

    index-with ::= 'WITH' expr
    Syntax diagram: see source code listing

    Use the WITH clause to specify additional options.

    expr

    An object with the following properties.

    Name Description Schema

    nodes
    optional

    An array of strings, each of which represents a node name.


    In Couchbase Server Community Edition, a single global secondary index can be placed on a single node that runs the indexing service. The nodes property allows you to specify the node that the index is placed on. If nodes is not specified, one of the nodes running the indexing service is randomly picked for the index.


    In Couchbase Server Enterprise Edition, you can specify multiple nodes to distribute replicas of an index across nodes running the indexing service: for example, WITH {"nodes": ["node1:8091", "node2:8091", "node3:8091"]}. For details and examples, see Index Replication.

    If specifying both nodes and num_replica, the number of nodes in the array must be one greater than the specified number of replicas otherwise the index creation will fail.

    If nodes is not specified, then the system chooses nodes on which to place the new index and any replicas, in order to achieve the best resource utilization across nodes running the indexing service. This is done by taking into account the current resource usage statistics of index nodes.


    A node name passed to the nodes property must include the cluster administration port, by default 8091.

    Example: ["192.0.2.0:8091"]

    String array

    defer_build
    optional

    Whether the index should be created in deferred build mode.

    When set to true, the CREATE INDEX operation queues the task for building the GSI index but immediately pauses the building of the index. Index building requires an expensive scan operation. Deferring building of the index with multiple indexes can optimize the expensive scan operation. Admins can defer building multiple indexes and, using the BUILD INDEX statement, build multiple indexes efficiently with one efficient scan of bucket data.

    When set to false, the CREATE INDEX operation queues the task for building the GSI index and immediately kicks off the building of the index.

    Default: false

    Boolean

    num_replica
    optional

    This property is only available in Couchbase Server Enterprise Edition.

    The number of replicas of the index to create.

    The indexer will automatically distribute these replicas amongst index nodes in the cluster for load-balancing and high availability purposes. The indexer will attempt to distribute the replicas based on the server groups in use in the cluster where possible.

    If the value of this property is not less than the number of index nodes in the cluster, then the index creation will fail.

    Default: 1

    Integer

    dimension
    required

    The number of dimensions in the vector. The embedded model you use to embed the vectors determines the number of dimensions in the vector.

    Integer

    similarity
    optional

    Sets the distance metric to use when comparing vectors during index creation.

    Couchbase Server uses the following strings to represent the distance metrics:

    COSINE

    Cosine Similarity

    DOT

    Dot Product

    L2
    EUCLIDEAN

    Euclidean Distance

    L2_SQUARED
    EUCLIDEAN_SQUARED

    Euclidean Squared Distance

    For the greatest accuracy, use the distance metric you plan to use to query the data.

    Default: L2_SQUARED

    String

    description
    optional

    The settings for the quantization and index algorithms. The string is made up of the following settings:

    IVF

    The number of centroids allocated for the index.

    SQ

    For scalar quantization — the number of bits used to store the centroid for each bin.

    PQ

    For product quantization — the number of subquantizers, and the number of bits in the centroid’s index value.

    For more information, see Quantization and Centroid Settings.

    Pattern: ^IVF[0-9]*,(SQ[468]|PQ[0-9]+x[0-9]+)$
    Default: IVF,SQ8

    String

    scan_nprobes
    optional

    The number of cells to search for each scan.

    Default: 1

    Integer

    train_list
    optional

    The size of the sample set of vectors to be used for index training.

    If the index count is < 10000, the default is to sample everything.

    Otherwise, the default value is 10% of the index count, or 10 × the number of centroids, whichever is higher.

    Maximum: 1000000

    Integer

    persist_full_vector
    optional

    If true (the default), the index stores the full vector value in addition to the quantized value. This significantly increases the size of the index.

    Set to false if you do not plan to use reranking. For more information, see Hyperscale Vector Index Reranking and Full Vector Persistence.

    Default: true

    Boolean

    Partitioned indexes support further options. See Index Partitioning.

    Usage

    Attention

    Do not create (or drop) secondary indexes, Composite Vector indexes, or Hyperscale Vector indexes when any Index service node is down, as this may result in duplicate index names.

    Monitoring Indexes

    Index metadata provides a state field. This state field and other index metadata can be queried using system:indexes. The index state may be scheduled for creation, deferred, building, pending, online, offline, or abridged. You can also monitor the index state using the Couchbase Web Console.

    If you kick off multiple index creation operations concurrently, you may sometimes see transient errors similar to the following. If this error occurs, the Index Service tries to run the failed operation again in the background until it succeeds, up to a maximum of 1000 retries.

    [
      {
        "code": 5000,
        "msg": "GSI CreateIndex() - cause: Encountered transient error.  Index creation will be retried in background.  Error: Index ... will retry building in the background for reason: Build Already In Progress. Keyspace ...",
        "query": "..."
      }
    ]

    If the Index Service still cannot create the index after the maximum number of retries, the index state is marked as offline. You must drop the failed index using the DROP INDEX command.

    Index Replicas

    In the Indexes screen in the Couchbase Web Console, index replicas are marked with their replica ID.

    The Indexes screen showing an index and index replica with replica ID

    If you select view by server node from the drop-down menu, you can see the server node where each index and index replica is placed.

    You can also query the system:indexes catalog to find the ID of an index replica and see which node it’s placed on.

    By default, index replicas are used to serve index scans. The system automatically load-balances an index scan across the index and all its replicas. Adding index replicas enables you to scale scan throughput, in addition to providing high availability.

    With Hyperscale Vector indexes and Composite Vector indexes, training is done by each replica index independently, and the codebook is stored as part of index metadata. See The Importance of Index Training.

    Defer Index Builds by Default

    Usually, the default setting for the defer_build option is false. In Couchbase Server 7.6.2 and later, you can change the default setting for the defer_build option.

    If you change the default setting for defer_build to true, index creation operates in deferred build mode by default.

    To change the default setting for deferred builds, use the Index Settings REST API to set the indexer.settings.defer_build property. For an example, see Defer Index Builds by Default.

    Examples

    To try the examples in this section, you must install the vector sample data as described in Prerequisites.

    Example 1. Create a Hyperscale Vector index

    For this example, the path to the required keyspace is specified by the query, so you do not need to set the query context.

    Create a Hyperscale Vector index for the vector column named embedding-vector-dot.

    CREATE VECTOR INDEX `color_desc_hyperscale` 
           ON `vector-sample`.`color`.`rgb`(`embedding_vector_dot` VECTOR)
           WITH { "dimension":1536, "similarity":"L2", "description":"IVF8,SQ4" }
    Example 2. Create a Hyperscale Vector index with included scalar values

    For this example, the path to the required keyspace is specified by the query, so you do not need to set the query context.

    Create a Hyperscale Vector index for the vector column named embedding-vector-dot, including the scalar brightness field.

    CREATE VECTOR INDEX `color_desc_hyperscale_brightness` 
           ON `vector-sample`.`color`.`rgb`(`embedding_vector_dot` VECTOR)
           INCLUDE (`brightness`)
           WITH { "dimension":1536, "similarity":"L2", "description":"IVF8,SQ4" }
    Example 3. Create a Hyperscale Vector index with no reranking

    For this example, the path to the required keyspace is specified by the query, so you do not need to set the query context.

    Create a Hyperscale Vector index from the example RGB dataset that does not persist the full vector value.

    CREATE VECTOR INDEX `color_desc_hyperscale_no_persist` 
           ON `vector-sample`.`color`.`rgb`(`embedding_vector_dot` VECTOR)
           WITH { "dimension":1536, "similarity":"L2", "description":"IVF8,SQ4", 
                  "persist_full_vector": false};