You are viewing the documentation for a prerelease version.

View Latest

Cost-Based Optimizer

  • Developer Preview
The cost-based optimizer takes into account the cost of memory, CPU, network transport, and disk usage when choosing the optimal plan to execute a query.

This is a Developer Preview feature, intended for development purposes only. Do not use this feature in production. No Enterprise Support is provided for Developer Preview features.

Refer to Developer Preview Mode for more information.

Overview

The cost-based optimizer (CBO) replaces the previous rules-based N1QL optimizer. It can generate a query plan for SELECT, UPDATE, DELETE, MERGE, and INSERT INTO with SELECT queries.

The cost-based optimizer uses metadata and statistics to estimate amount of processing (memory, CPU, and I/O) for each operation. It compares the cost of alternative routes, and then selects the query-execution plan with the least cost.

Query execution flow
Figure 1. Query execution flow, showing the cost-based optimizer using statistics and metadata

Optimizer Statistics

The cost-based optimizer uses the following statistics. These statistics are gathered using the UPDATE STATISTICS statement.

For keyspaces:

  • The number of documents in the keyspace.

  • The average document size.

For indexes:

  • The number of items in the index.

  • The number of index pages.

  • The resident ratio.

  • The average item size.

  • The average page size.

For data:

Distribution Statistics

The cost-based optimizer can collect distribution statistics on predicate expressions. These predicate expressions may be fields, nested fields, array expressions, or any of the expressions supported as an index key.

The distribution statistics enable cost estimation for predicates like c1 = 100, c1 >= 20, or c1 < 150. They also enable cost estimates for join predicates such as t1.c1 = t2.c2, assuming distribution statistics exist for both t1.c1 and t2.c2.

Distribution Bins

The optimizer takes a sample of the values returned by the expression across the keyspace. These sample values are sorted into distribution bins by data type and value.

  1. Values with different data types are placed into separate distribution bins. (A field may contain values of several different data types across documents.)

  2. After being separated by data type, values are sorted further into separate bins depending on their value.

The distribution bins are of approximately equal size, except for the last distribution bin for each data type, which could be a partial bin.

The sample size can be specified when you use the UPDATE STATISTICS statement.

Resolution

The number of distribution bins is determined by the resolution.

The default resolution is 1.0, meaning each distribution bin contains 1% of the documents, and therefore 100 bins are required. The minimum resolution is 0.02 (5000 distribution bins) and the maximum is 5.0 (20 distribution bins). The cost-based optimizer calculates the bin size based on the resolution and the number of documents in the collection.

The resolution can be specified when you use the UPDATE STATISTICS statement.

Overflow Bins

For each distribution bin, the number of distinct values is calculated, as a fraction of the total number of documents.

If a particular value is highly duplicated and represents more than 25% of a distribution bin, it is removed from the distribution bin and placed in an overflow bin. MISSING, NULL, or boolean values are always placed in an overflow bin.

Boundary Bins

Each distribution bin has a maximum value, which acts as the minimum value for the next bin.

A boundary bin containing no values is created before the first distribution bin of each different data type. The boundary bin contains no values. This provides the minimum value for the first bin of each type.

Histogram

The boundary bins, distribution bins, and overflow bins for each data type are chained together in the default ascending collation order used for N1QL data types:

  • MISSING

  • NULL

  • FALSE

  • TRUE

  • number

  • string

  • array

  • object

  • binary (non-JSON)

This forms a histogram of statistics for the index-key expression across multiple data types.

Distribution bins
Figure 2. Distribution bins and boundary bins for integers, strings, and arrays

Operations

When a query is executed, the optimizer performs the following tasks:

  1. Rewrite the query if necessary, in the same manner as the previous rules-based optimizer.

  2. Use the distribution histogram and index statistics to estimate the selectivity of a predicate — that is, the number of documents that the optimizer expects to retrieve which satisfy this predicate.

  3. Use the selectivity to estimate the cardinality — that is, the number of documents remaining after all applicable predicates are applied.

  4. Use the cardinality to estimate the cost of different access paths.

  5. Compare the costs and generate a query execution plan with the lowest cost.

The cost-based optimizer does not yet rewrite the query to use the optimal join ordering or join type.