CostBased Optimizer
 Developer Preview
The costbased optimizer takes into account the cost of memory, CPU, network transport, and disk usage when choosing the optimal plan to execute a query.
This is a Developer Preview feature, intended for development purposes only. Do not use this feature in production. No Enterprise Support is provided for Developer Preview features. Refer to Developer Preview Mode for more information. 
Overview
The costbased optimizer (CBO) replaces the previous rulesbased N1QL optimizer. It can generate a query plan for SELECT, UPDATE, DELETE, MERGE, and INSERT INTO with SELECT queries.
The costbased optimizer uses metadata and statistics to estimate amount of processing (memory, CPU, and I/O) for each operation. It compares the cost of alternative routes, and then selects the queryexecution plan with the least cost.
Optimizer Statistics
The costbased optimizer uses the following statistics. These statistics are gathered using the UPDATE STATISTICS statement.
For keyspaces:

The number of documents in the keyspace.

The average document size.
For indexes:

The number of items in the index.

The number of index pages.

The resident ratio.

The average item size.

The average page size.
For data:

Distribution statistics — refer to the section below.
Distribution Statistics
The costbased optimizer can collect distribution statistics on predicate expressions. These predicate expressions may be fields, nested fields, array expressions, or any of the expressions supported as an index key.
The distribution statistics enable cost estimation for predicates like c1 = 100
, c1 >= 20
, or c1 < 150
.
They also enable cost estimates for join predicates such as t1.c1 = t2.c2
, assuming distribution statistics exist for both t1.c1
and t2.c2
.
Distribution Bins
The optimizer takes a sample of the values returned by the expression across the keyspace. These sample values are sorted into distribution bins by data type and value.

Values with different data types are placed into separate distribution bins. (A field may contain values of several different data types across documents.)

After being separated by data type, values are sorted further into separate bins depending on their value.
The distribution bins are of approximately equal size, except for the last distribution bin for each data type, which could be a partial bin.
The sample size can be specified when you use the UPDATE STATISTICS statement.
Resolution
The number of distribution bins is determined by the resolution.
The default resolution is 1.0
, meaning each distribution bin contains 1% of the documents, and therefore 100 bins are required.
The minimum resolution is 0.02
(5000 distribution bins) and the maximum is 5.0
(20 distribution bins).
The costbased optimizer calculates the bin size based on the resolution and the number of documents in the collection.
The resolution can be specified when you use the UPDATE STATISTICS statement.
Overflow Bins
For each distribution bin, the number of distinct values is calculated, as a fraction of the total number of documents.
If a particular value is highly duplicated and represents more than 25% of a distribution bin, it is removed from the distribution bin and placed in an overflow bin. MISSING, NULL, or boolean values are always placed in an overflow bin.
Boundary Bins
Each distribution bin has a maximum value, which acts as the minimum value for the next bin.
A boundary bin containing no values is created before the first distribution bin of each different data type. The boundary bin contains no values. This provides the minimum value for the first bin of each type.
Histogram
The boundary bins, distribution bins, and overflow bins for each data type are chained together in the default ascending collation order used for N1QL data types:

MISSING

NULL

FALSE

TRUE

number

string

array

object

binary (nonJSON)
This forms a histogram of statistics for the indexkey expression across multiple data types.
Operations
When a query is executed, the optimizer performs the following tasks:

Rewrite the query if necessary, in the same manner as the previous rulesbased optimizer.

Use the distribution histogram and index statistics to estimate the selectivity of a predicate — that is, the number of documents that the optimizer expects to retrieve which satisfy this predicate.

Use the selectivity to estimate the cardinality — that is, the number of documents remaining after all applicable predicates are applied.

Use the cardinality to estimate the cost of different access paths.

Compare the costs and generate a query execution plan with the lowest cost.
The costbased optimizer does not yet rewrite the query to use the optimal join ordering or join type.
Related Links

UPDATE STATISTICS statement

Blog post: Cost Based Optimizer for Couchbase N1QL