Default Analyzers

Capella Operational

reference

March 23, 2025

+ 12

Use an analyzer to filter and modify search strings to improve matches for search results.

Analyzers contain:

Character filters, which remove unwanted characters from search input.
Tokenizers, which separate input strings into individual tokens.
Token filters, which modify tokens.

When you create a type mapping, you can choose a default analyzer for your type mappings, or create your own.

The following default analyzer options are available:

Analyzer Description

Analyzer	Description
inherit	If you set an analyzer to `inherit`, the Search index component inherits the default analyzer set for an index.
Arabic - ar	An Arabic language analyzer.
Chinese, Japanese, and Korean - cjk	An analyzer designed for the Chinese, Japanese, and Korean languages.
Kurdish - ckb	A Kurdish language analyzer.
Danish - da	A Danish language analyzer.
German - de	A German language analyzer.
English - en	An English language analyzer.
Castilian Spanish - es	A Castilian Spanish language analyzer.
Persian - fa	A Persian language analyzer.
Finnish - fi	A Finnish language analyzer.
French - fr	A French language analyzer.
Hebrew - he	A Hebrew language analyzer.
Hindi - hi	A Hindi language analyzer.
Croatian - hr	A Croatian language analyzer.
Hungarian - hu	A Hungarian language analyzer.
Italian - it	An Italian language analyzer.
keyword	The `keyword` analyzer turns input into a single token. It forces exact matches and preserves whitespace characters like spaces. For example, the `keyword` analyzer turns an input of `Couchbase Server` into a single token: `Couchbase Server`.
Dutch - nl	A Dutch language analyzer.
Norwegian - no	A Norwegian language analyzer.
Portuguese - pt	A Portuguese language analyzer.
Romanian - ro	A Romanian language analyzer.
Russian - ru	A Russian language analyzer.
simple	The `simple` analyzer turns input into tokens based on letter characters. It removes characters like punctuation and numbers, and uses these characters as the boundaries for tokens. For example, the `simple` analyzer turns an input of `Couchbase Server` into two tokens: `Couchbase` and `Server`.
standard	The `standard` analyzer uses the `unicode` tokenizer with the `to_lower` and `stop_en` token filters. For example, the `standard` analyzer turns an input of `The name is Couchbase Server` into three tokens: `name`, `couchbase`, and `server`.
Swedish - sv	A Swedish language analyzer.
Turkish - tr	A Turkish language analyzer.
web	The `web` analyzer finds email addresses, URLs, Twitter usernames, and hashtags in its input and turns them into tokens. For example, the `web` analyzer turns an input of `Send #Couchbase to example@gmail.com` into four tokens: `send`, `#Couchbase`, `to`, and `example@gmail.com`.

inherit

If you set an analyzer to inherit, the Search index component inherits the default analyzer set for an index.

Arabic - ar

An Arabic language analyzer.

Chinese, Japanese, and Korean - cjk

An analyzer designed for the Chinese, Japanese, and Korean languages.

Kurdish - ckb

A Kurdish language analyzer.

Danish - da

A Danish language analyzer.

German - de

A German language analyzer.

English - en

An English language analyzer.

Castilian Spanish - es

A Castilian Spanish language analyzer.

Persian - fa

A Persian language analyzer.

Finnish - fi

A Finnish language analyzer.

French - fr

A French language analyzer.

Hebrew - he

A Hebrew language analyzer.

Hindi - hi

A Hindi language analyzer.

Croatian - hr

A Croatian language analyzer.

Hungarian - hu

A Hungarian language analyzer.

Italian - it

An Italian language analyzer.

keyword

The keyword analyzer turns input into a single token. It forces exact matches and preserves whitespace characters like spaces.

For example, the keyword analyzer turns an input of Couchbase Server into a single token: Couchbase Server.

Dutch - nl

A Dutch language analyzer.

Norwegian - no

A Norwegian language analyzer.

Portuguese - pt

A Portuguese language analyzer.

Romanian - ro

A Romanian language analyzer.

Russian - ru

A Russian language analyzer.

simple

The simple analyzer turns input into tokens based on letter characters. It removes characters like punctuation and numbers, and uses these characters as the boundaries for tokens.

For example, the simple analyzer turns an input of Couchbase Server into two tokens: Couchbase and Server.

standard

The standard analyzer uses the unicode tokenizer with the to_lower and stop_en token filters.

For example, the standard analyzer turns an input of The name is Couchbase Server into three tokens: name, couchbase, and server.

Swedish - sv

A Swedish language analyzer.

Turkish - tr

A Turkish language analyzer.

web

The web analyzer finds email addresses, URLs, Twitter usernames, and hashtags in its input and turns them into tokens.

For example, the web analyzer turns an input of Send #Couchbase to example@gmail.com into four tokens: send, #Couchbase, to, and example@gmail.com.