Adaptive Index

Adaptive Indexes are a special type of GSI array index that can index all or specified fields of a document. Such an index is generic in nature, and it can efficiently index and lookup any of the index-key values. This enables efficient ad hoc queries (that may have WHERE clause predicates on any of the index-key fields) without requiring to create various composite indexes for different combinations of fields. Adaptive Index is a functional array index created using the N1QL function PAIRS().

Basically, the idea is to be able to simply load data and start querying:

  • using a single secondary index, and

  • not worrying about creating appropriate secondary indexes for each query.

Note that without Adaptive Indexes:

  • Only primary index can help run any ad hoc query. But using primary index can be expensive for queries with predicates on any of the non-key fields of the document.

  • Each query will need a compatible secondary index that can qualify for the predicates in the WHERE clause. See section Contrast with Composite Indexes for details.

For instance, consider a user profile or hotel reservation search use case. A person’s profile may need to be searched based on any of the personal attributes such as first name, last name, age, city, address, job, title, company, etc. Similarly, a hotel room availability may be searched based on wide criteria, such as room facilities, amenities, price, and other features. In this scenario, traditional secondary indexes or composite indexes can’t be used effectively (see section Contrast with Composite Indexes to understand some of the concerns). Adaptive indexes can help effectively and efficiently run such ad hoc search queries.

Syntax

create-index ::= CREATE INDEX index-name ON named-keyspace-ref '(' index-key [ ',' index-key ]* ')' [ where-clause ] [ index-using ] [ index-with ]
'CREATE' 'INDEX' index-name 'ON' named-keyspace-ref '(' index-key ( ',' index-key )* ')' where-clause? index-using? index-with?

An adaptive index is a type of array index. To create an adaptive index, the overall syntax is the same as for an array index.

index-name

Specify a unique name to identify the index.

Named Keyspace Reference

named-keyspace-ref ::= [ namespace ':' ] keyspace
( namespace ':' )? keyspace

Specify the name of the keyspace or bucket to create an index on.

Index Key

index-key ::= ( ALL | DISTINCT ) pairs-function
( 'ALL' | 'DISTINCT' ) pairs-function

To create an adaptive index, the index key must be a simple array expression containing a PAIRS() function.

PAIRS() Function

pairs-function ::= PAIRS '(' SELF | index-key-object ')'
'PAIRS' '(' ( 'SELF' | index-key-object ) ')'

When the SELF keyword is used, the adaptive index is created with all fields in the documents of the keyspace.

If you want to create an adaptive index on selected fields only, you must specify an index key object.

Index Key Object

index-key-object ::= '{' ['"' name '"' ':'] expr [',' ['"' name '"' ':'] expr ]* '}'
'{' ('"' name '"' ':')? expr (',' ('"' name '"' ':')? expr )* '}'

This is an object of name-value pairs of the document fields that should be indexed.

name

The field name that corresponds to expr.

expr

A N1QL expression that is allowed in CREATE INDEX. This must be an expression over any document fields.

In a simplified form, the names in the object can be omitted, in which case the corresponding names will be same as expressions, and the expressions must be field names in the document that are being indexed.

When using PAIRS() with an OBJECT construction, you need to keep in mind:

  • {a, c.a} – when evaluated, both will inherit the same name of a causing one value to overwrite the other. Both values will not be indexed. A better way to handle this is to name one with an alias explicitly, such as {a, "ca":c.a}.

  • {abs(a)} – name of the object field is null and will raise an error. A better way to handle this is to use an alias explicitly, such as {"abs_a":abs(a)}.

WHERE Clause

where-clause ::= WHERE cond
'WHERE' cond
cond

Specifies WHERE clause predicates to qualify the subset of documents to include in the index.

USING Clause

index-using ::= USING ( VIEW | GSI )
'USING' ( 'VIEW' | 'GSI' )

The USING clause specifies the index type to use. To create an adaptive index, the index type must be GSI; as this is the default, the USING clause may be omitted.

WITH Clause

index-with ::= WITH expr
'WITH' expr

Use the WITH clause to specify additional options.

expr

An object specifying additional options for the query.

See the CREATE INDEX statement for more details on the syntax.

Examples

For example, consider the travel-sample data that is shipped with the product and the following indexes. Sample buckets can be installed as explained here.

C1
CREATE INDEX `idx_airportname` ON `travel-sample`(`airportname`) WHERE type = "airport";
C2
CREATE INDEX `idx_city` ON `travel-sample`(`city`) WHERE type = "airport";
C3
CREATE INDEX `idx_faa` ON `travel-sample`(`faa`) WHERE type = "airport";

Here, three different indexes need to be created to help different queries whose WHERE clause predicates may refer to different fields. For instance, the following queries Q1, Q2, and Q3 will use the indexes created in C1, C2, and C3, respectively:

Q1
SELECT * FROM `travel-sample` WHERE airportname LIKE "San Francisco%" AND type = "airport";
Q2
SELECT * FROM `travel-sample` WHERE city = "San Francisco" AND type = "airport";
Q3
SELECT * FROM `travel-sample` WHERE faa = "SFO" AND type = "airport";

However, the following single adaptive index C4 can serve all three of the following queries Q1A, Q2A, and Q3A:

C4
CREATE INDEX `ai_airport_day_faa`
ON `travel-sample`(DISTINCT PAIRS({airportname, city, faa, type}))
WHERE type = "airport";
Q1A
SELECT * FROM `travel-sample`
USE INDEX (ai_airport_day_faa)
WHERE airportname LIKE "San Francisco%" AND type = "airport";
Q2A
SELECT * FROM `travel-sample`
USE INDEX (ai_airport_day_faa)
WHERE city = "San Francisco" AND type = "airport";
Q3A
SELECT * FROM `travel-sample`
USE INDEX (ai_airport_day_faa)
WHERE faa = "SFO" AND type = "airport";

Similarly, the following adaptive index over SELF in C5 is also qualified for these queries. In fact, an adaptive index that includes all fields in the documents can serve any query on the travel-sample keyspace, though it might have different performance characteristics when compared to specific indexes created for a particular query. See the section Performance Implications for details. For example, the following queries Q5 and Q5A show how the generic adaptive index C5 is used to query predicates on different fields of the "airport" documents.

C5
CREATE INDEX `ai_self`
ON `travel-sample`(DISTINCT PAIRS(self))
WHERE type = "airport";
Q5
EXPLAIN SELECT * FROM `travel-sample`
USE INDEX (ai_self)
WHERE faa = "SFO" AND (`type` = "airport");
Result
{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IntersectScan",
        "scans": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan3",
              "index": "ai_self",
              "index_id": "5b3b131cf32ed25",
              "index_projection": {
                "primary_key": true
              },
              "keyspace": "travel-sample",
              "namespace": "default",
              "spans": [
                {
                  "exact": true,
                  "range": [
                    {
                      "high": "[\"faa\", \"SFO\"]",
                      "inclusion": 3,
                      "low": "[\"faa\", \"SFO\"]"
                    }
                  ]
                }
              ],
              "using": "gsi"
            }
...
Q5A
EXPLAIN SELECT *
FROM `travel-sample`
USE INDEX (ai_self)
WHERE tz = "Europe/Paris"
AND (`type` = "airport");
Result
{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IntersectScan",
        "scans": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan3",
              "index": "ai_self",
              "index_id": "5b3b131cf32ed25",
              "index_projection": {
                "primary_key": true
              },
              "keyspace": "travel-sample",
              "namespace": "default",
              "spans": [
                {
                  "exact": true,
                  "range": [
                    {
                      "high": "[\"tz\", \"Europe/Paris\"]",
                      "inclusion": 3,
                      "low": "[\"tz\", \"Europe/Paris\"]"
                    }
                  ]
                }
              ],
              "using": "gsi"
            }
...

Contrast with Composite Indexes

Traditionally, composite secondary indexes are used to create indexes with multiple index keys. For example, consider the following index in C6:

C6
CREATE INDEX `def_city_faa_airport`
ON `travel-sample`(city, faa, airportname)
WHERE (`type` = "airport");

Such composite indexes are very different from the adaptive index in C4 in many ways:

  1. Order of index keys is vital for composite indexes. When an index key is used in the WHERE clause, all prefixing index keys in the index definition must also be specified in the WHERE clause. For example, to use the index C6, a query to "find details of airports with FAA code SFO", must specify the prefixing index key city also in the WHERE clause just to qualify the index C6. Contrast the following query Q6 with Q3 above that uses the adaptive index in C3.

    Q6
    SELECT * FROM `travel-sample`
    WHERE faa = "SFO"
    AND city IS NOT MISSING;

    The problem is not just the addition of an extraneous predicate, but the performance. The predicate on the first index key city IS NOT MISSING is highly selective (i.e. most of the index entries in the index will match it) and hence, it will result in almost a full index scan.

  2. Complication in Queries. If a document has many fields to index, then the composite index will end up with all those fields as index keys. Subsequently, queries that only need to use index keys farther in the index key order will need many unnecessary predicates referring to all the preceding index keys. For example, if the index is:

    CREATE INDEX idx_name ON `travel-sample`(field1, field2, ..., field9);

    A query that has a predicate on field9 will get unnecessarily complicated, as it needs to use all preceding index keys from field1 to field8.

  3. Explosion of number of indexes for ad hoc queries. At some point, it becomes highly unnatural and overly complicated to write ad hoc queries using composite indexes. For instance, consider a user profile or inventory search use case where a person or item may need to be searched based on many criteria.

    One approach is to create indexes on all possible attributes. If that query can include any of the attributes, then it may require creation of innumerable indexes. For example, a modest 20 attributes will result in 20 factorial (2.43×1018) indexes in order to consider all combinations of sort orders of the 20 attributes.

Partial Adaptive Indexes

An adaptive index may also be a partial index, as can be seen by examining examples C4 and C5. For a partial adaptive index, you must ensure that any fields filtered by the WHERE clause in the index definition are also referenced by the PAIRS() function.

For example, the following query Q7 cannot select the index defined in C7A.

C7A
CREATE INDEX ai_geo ON `travel-sample`(DISTINCT PAIRS({geo.alt, geo.lat, geo.lon})) WHERE type = "airport"; (1)
Q7
EXPLAIN SELECT META(t).id FROM `travel-sample` t WHERE t.geo.alt > 1000 AND t.type = "airport";
Result
[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "IndexScan3",
          "as": "t",
          "index": "def_type", (2)
          "index_id": "896b653092c61512",
...
1 The WHERE clause filters on type, but the PAIRS() function does not include the type field.
2 The query does not use the incorrectly-defined partial adaptive index.

However, the same query Q7 does select the partial adaptive index defined in C7B.

C7B
CREATE INDEX ai_geo_type ON `travel-sample`(DISTINCT PAIRS({geo.alt, geo.lat, geo.lon, type})) WHERE type = "airport"; (1)
Result
[
  {
    "plan": {
      "#operator": "Sequence",
      "~children": [
        {
          "#operator": "IntersectScan",
          "scans": [
            {
              "#operator": "IndexScan3",
              "as": "t",
              "index": "def_type",
              "index_id": "896b653092c61512",
              "index_projection": {
                "primary_key": true
              },
              "keyspace": "travel-sample",
              "namespace": "default",
              "spans": [
                {
                  "exact": true,
                  "range": [
                    {
                      "high": "\"airport\"",
                      "inclusion": 3,
                      "low": "\"airport\""
                    }
                  ]
                }
              ],
              "using": "gsi"
            },
            {
              "#operator": "DistinctScan",
              "scan": {
                "#operator": "IndexScan3",
                "as": "t",
                "index": "ai_geo_type", (2)
                "index_id": "228048b17f72726b",
                "index_projection": {
                  "primary_key": true
                },
...
1 The WHERE clause filters on type, and the PAIRS() function includes the type field.
2 The query does an IntersectScan, including the correct partial adaptive index.

Alternatively, you can use the SELF keyword to ensure that the fields used in the WHERE clause are included in the PAIRS() function. Refer to C5 for an example.

An IntersectScan does not eliminate redundant queries, and this may impact performance. Refer to Performance Implications for details.

Performance Implications

While Adaptive Indexes are very useful, there are performance implications you need to keep in mind:

  1. If a query is not covered by a regular index, then an unnested index will not have any elimination of redundant indexes; and it will instead do an IntersectScan on all the indexes, which can impact performance.

    CREATE INDEX idx_name ON `travel-sample`(name); (1)
    CREATE INDEX idx_self ON `travel-sample`(DISTINCT PAIRS(self)); (2)
    EXPLAIN SELECT * FROM `travel-sample` WHERE name IS NOT NULL;
    Results
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IntersectScan", (3)
            "scans": [
    ...
    1 Index on the name field.
    2 Adaptive index on the whole document.
    3 IntersectScan of idx_name AND idx_self.

    Here’s another example with a partial Adaptive Index that uses IntersectScan on the index conditions:

    CREATE INDEX idx_adpt ON `travel-sample`(DISTINCT PAIRS(self)) WHERE city="Paris";
    CREATE INDEX idx_reg1 ON `travel-sample`(name) WHERE city="Paris";
    CREATE INDEX idx_reg2 ON `travel-sample`(city);
    
    SELECT * FROM `travel-sample` WHERE city="Paris" AND name IS NOT NULL;

    The above query requires only a regular index, so it uses index idx_reg1 and ignores index idx_reg2. When the adaptive index idx_adpt has only the clause city="Paris" and is used with the above query, then index idx_adpt will still use IntersectScan. Here, we have only a single adaptive index instead of a reduction in the number of indexes. To fix this, you may need to remove the index condition from the predicate while spanning generations.

Functional Limitations

It is important to understand that adaptive indexes are not a panacea and that they have trade-offs compared to traditional composite indexes:

  1. Adaptive Indexes are bound to the limitations of Array Indexes because they are built over Array Indexing technology. Index Joins can’t use Adaptive Indexes because Index Joins can’t use array indexes, and Adaptive Index is basically an array index.

  2. Indexed entries of the Adaptive Index are typically larger in size compared to the simple index on respective fields because the indexed items are elements of the PAIRS() array, which are basically name-value pairs of the document fields. So, it may be relatively slower when compared with equivalent simple index. For example, in the following equivalent queries, C8/Q8 may perform better than C9/Q9.

    This example uses the def_city index, which is installed with the travel-sample bucket.

    C8
    CREATE INDEX `def_city` ON `travel-sample`(`city`);
    Q8
    EXPLAIN SELECT city FROM `travel-sample` USE INDEX (def_city) WHERE city = "San Francisco";
    Result
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IndexScan3",
            "covers": [
              "cover ((`travel-sample`.`city`))",
              "cover ((meta(`travel-sample`).`id`))"
            ],
            "index": "def_city",
            "index_id": "d7c415d64edaa432",
            "index_projection": {
              "entry_keys": [
                0
              ]
            },
            "keyspace": "travel-sample",
            "namespace": "default",
            "spans": [
              {
                "exact": true,
                "range": [
                  {
                    "high": "\"San Francisco\"",
                    "inclusion": 3,
                    "low": "\"San Francisco\""
                  }
                ]
              }
            ],
            "using": "gsi"
          },
    ...
    C9
    CREATE INDEX `ai_city` ON `travel-sample`(DISTINCT PAIRS({city}));
    Q9
    EXPLAIN SELECT city FROM `travel-sample` USE INDEX (ai_city) WHERE city = "San Francisco";
    Result
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan3",
              "index": "ai_city",
              "index_id": "4bc113bc5ab7ac3a",
              "index_projection": {
                "primary_key": true
              },
              "keyspace": "travel-sample",
              "namespace": "default",
              "spans": [
                {
                  "exact": true,
                  "range": [
                    {
                      "high": "[\"city\", \"San Francisco\"]",
                      "inclusion": 3,
                      "low": "[\"city\", \"San Francisco\"]" (1)
                    }
                  ]
                }
              ],
              "using": "gsi"
            }
          },
    ...
    1 Note how the index key values are represented in the spans.
  3. Adaptive index requires more storage and memory, especially in case of Memory Optimized Indexes.

    1. The size of the index and the number of indexed items in an Adaptive Index grow rapidly with the number of fields in the documents, as well as, with the number of different values for various fields in the documents or keyspace.

    2. Moreover, if the documents have nested sub-objects, then the adaptive index will index the sub-documents and related fields at each level of nesting.

    3. Similarly, if the documents have array fields, then each of array elements are explored and indexed.

    4. For example, the following queries show that a single route document in travel-sample generates 103 index items and that all route documents produce ~2.3 million items.

      SELECT array_length(PAIRS(self)) FROM `travel-sample`
      WHERE type = "route" LIMIT 1;
      Result
      [
        {
          "$1": 103
        }
      ]
      SELECT sum(array_length(PAIRS(self))) FROM `travel-sample`
      WHERE type = "route" LIMIT 1;
      Result
      [
        {
          "$1": 2285464
        }
      ]

    So, the generic adaptive indexes (with SELF) should be employed carefully. Whenever applicable, it is recommended to use the following techniques to minimize the size and scope of the adaptive index:

    • Instead of SELF, use selective adaptive indexes by specifying the field names of interest to the PAIRS() function. For examples, refer to C4, Q1, Q2, and Q3 above.

    • Use partial adaptive indexes with a WHERE clause that will filter the number of documents that will be indexed. For examples, refer to C5, Q5, and Q5A above.

  4. A generic adaptive index (on SELF) will be qualified for all queries on the keyspace. So, when using with other GSI indexes, this will result in more IntersectScan operations for queries that qualify other non-adaptive indexes. This may impact query performance and overall load on query and indexer nodes. To alleviate the negative effects, you may want to specify the USE INDEX clause in SELECT queries whenever possible.

  5. Adaptive Indexes cannot be used as Covered Indexes for any queries. See example Q9 above.

  6. Adaptive Indexes can be created only on document field identifiers, not on functional expressions on the fields. For example, the following query uses a default index, such as def_city, instead of the specified adaptive index ai_city1:

    CREATE INDEX `ai_city1` ON `travel-sample`(DISTINCT PAIRS({"city" : LOWER(city)}));
    EXPLAIN SELECT city FROM `travel-sample`
    USE INDEX (ai_city1)
    WHERE LOWER(city) = "san francisco";
    Result
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IndexScan3",
            "covers": [
              "cover ((`travel-sample`.`city`))",
              "cover ((meta(`travel-sample`).`id`))"
            ],
            "index": "def_city", (1)
            "index_id": "d7c415d64edaa432",
    ...
    1 This does not use our specified ai_city1 index because it’s a functional index expression on the field city.
  7. Adaptive Indexes do not work with NOT LIKE predicates with a leading wildcard (see MB-23981). For example, the following query also uses a default index, such as def_city, instead of the specified adaptive index ai_city. However, it works fine for LIKE predicates with a leading wildcard.

    EXPLAIN SELECT city FROM `travel-sample`
    USE INDEX (ai_city)
    WHERE city NOT LIKE "%Francisco";
    Result
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IndexScan3",
            "covers": [
              "cover ((`travel-sample`.`city`))",
              "cover ((meta(`travel-sample`).`id`))"
            ],
            "index": "def_city", (1)
            "index_id": "d7c415d64edaa432",
    ...
    EXPLAIN SELECT city FROM `travel-sample`
    USE INDEX (ai_city)
    WHERE city LIKE "%Francisco";
    Result
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan3",
              "index": "ai_city", (2)
              "index_id": "4bc113bc5ab7ac3a",
    ...
    1 Doesn’t use ai_city with NOT LIKE and leading wildcard.
    2 Uses ai_city with LIKE and leading wildcard.
  8. Adaptive indexes can’t use Covered Scans. An adaptive index can’t be a covering index, as seen in the following example:

    CREATE INDEX `ai_city2` ON `travel-sample`(DISTINCT PAIRS({"city" : city}));
    EXPLAIN SELECT city FROM `travel-sample`
    WHERE city = "San Francisco"; (1)
    Result
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IndexScan3",
            "covers": [
              "cover ((`travel-sample`.`city`))",
              "cover ((meta(`travel-sample`).`id`))"
            ],
            "index": "def_city", (2)
            "index_id": "d7c415d64edaa432",
    ...
    1 No index specified in query.
    2 Doesn’t use ai_city2 as a covering index.