- Couchbase Server 7.0
The cost-based optimizer takes into account the cost of memory, CPU, network transport, and disk usage when choosing the optimal plan to execute a query.
The cost-based optimizer (CBO) is a built-in feature that enables the Query service to create the most efficient plan to execute a query.
The cost-based optimizer uses metadata and statistics to estimate the amount of processing (memory, CPU, and I/O) required for each operation. It compares the cost of alternative routes, and then selects the query-execution plan with the least cost.
Before you can use the cost-based optimizer with a query, you must first gather the statistics that it needs using the UPDATE STATISTICS statement. The cost-based optimizer uses the following statistics.
The number of documents in the keyspace.
The average document size.
The number of items in the index.
The number of index pages.
The resident ratio.
The average item size.
The average page size.
Distribution statistics — refer to the section below.
The cost-based optimizer can collect distribution statistics on predicate expressions. These predicate expressions may be fields, nested fields, array expressions, or any of the expressions supported as an index key.
The distribution statistics enable the optimizer to estimate the cost for predicates like
c1 = 100,
c1 >= 20, or
c1 < 150.
They also enable cost estimates for join predicates such as
t1.c1 = t2.c2, assuming distribution statistics exist for both
The optimizer takes a sample of the values returned by the expression across the keyspace. These sample values are sorted into distribution bins by data type and value.
Values with different data types are placed into separate distribution bins. (A field may contain values of several different data types across documents.)
After being separated by data type, values are sorted further into separate bins depending on their value.
The distribution bins are of approximately equal size, except for the last distribution bin for each data type, which could be a partial bin.
The sample size can be specified when you use the UPDATE STATISTICS statement.
The number of distribution bins is determined by the resolution.
The default resolution is
1.0, meaning each distribution bin contains 1% of the documents, and therefore 100 bins are required.
The minimum resolution is
0.02 (5000 distribution bins) and the maximum is
5.0 (20 distribution bins).
The cost-based optimizer calculates the bin size based on the resolution and the number of documents in the collection.
The resolution can be specified when you use the UPDATE STATISTICS statement.
For each distribution bin, the number of distinct values is calculated, as a fraction of the total number of documents.
If a particular value is highly duplicated and represents more than 25% of a distribution bin, it is removed from the distribution bin and placed in an overflow bin. MISSING, NULL, or boolean values are always placed in an overflow bin.
Each distribution bin has a maximum value, which acts as the minimum value for the next bin.
A boundary bin containing no values is created before the first distribution bin of each different data type. The boundary bin contains no values. This provides the minimum value for the first bin of each type.
The boundary bins, distribution bins, and overflow bins for each data type are chained together in the default ascending collation order used for N1QL data types:
This forms a histogram of statistics for the index-key expression across multiple data types.
The cost-based optimizer is active by default. You can activate or deactivate it as required.
use_cboparameter specifies whether the cost-based optimizer is enabled per request. If a request does not include this parameter, the node-level setting is used.
use-cbosetting specifies whether the cost-based optimizer is enabled for a single query node. It defaults to
queryUseCBOsetting enables you to specify the node-level setting for all the nodes in the cluster.
You can also activate or deactivate the cost-based optimizer using the Query Settings in the Couchbase Web Console.
If the cost-based optimizer is not active, the Query service falls back on the rules-based N1QL optimizer.
When active, the optimizer performs the following tasks when a query is executed:
Rewrite the query if necessary, in the same manner as the previous rules-based optimizer.
Use the distribution histogram and index statistics to estimate the selectivity of a predicate — that is, the number of documents that the optimizer expects to retrieve which satisfy this predicate.
Use the selectivity to estimate the cardinality — that is, the number of documents remaining after all applicable predicates are applied.
Use the cardinality to estimate the cost of different access paths.
Compare the costs and generate a query execution plan with the lowest cost.
The cost-based optimizer can choose the optimal join type for each join, but does not yet rewrite the query to use the optimal join ordering.