Default Token Filters

reference

Use a token filter to filter a tokenizer’s results and get better search result matches.

The Search Service’s token filters work with tokenizers to filter search input tokens. Tokens can come from the content of your Search index or a Search query.

For more information about token filters, see Token Filters.

The following token filters are available:

Token Filter Type Description

Token Filter Type	Description
apostrophe	Removes all characters after an apostrophe (') from tokenizer results. Also removes the apostrophe. For example, the token `Couchbase’s` becomes `Couchbase`.
camelCase	Splits text in camelCase inside a token into separate tokens. For example, the token filter splits the token `camelCaseText` into `camel`, `Case`, and `Text`.
cjk_bigram	Converts Chinese, Japanese, and Korean tokenizer results into bigrams, or groups of two consecutive words.
cjk_width	Converts Chinese, Japanese, and Korean tokenizer results from full width ASCII variants into Latin characters, and half-width katakana characters into their equivalent kana characters.
elision_ca	Removes all characters before an apostrophe from Catalan language tokenizer results. Also removes the apostrophe.
elision_fr	Removes all characters before an apostrophe from French language tokenizer results. Also removes the apostrophe. For example, the token filter converts the token `l’avion` to `avion`.
elision_ga	Removes all characters before an apostrophe from Gaelic language tokenizer results. Also removes the apostrophe.
elision_it	Removes all characters before an apostrophe from Italian language tokenizer results. Also removes the apostrophe.
hr_suffix_transformation_filter	Replaces suffixes in Croatian tokenizer results with normalized suffixes.
lemmatizer_he	Lemmatizes similar forms of Hebrew words. Corrects spelling mistakes.
mark_he	Marks the Hebrew, non-Hebrew, and numeric tokens from tokenizer results.
niqqud_he	Forces niqqud-less spelling for Hebrew text in tokenizer results.
normalize_ar	Uses Unicode Normalization to normalize Arabic characters in tokens.
normalize_ckb	Uses Unicode Normalization to normalize Kurdish characters in tokens.
normalize_de	Uses Unicode Normalization to normalize German characters in tokens.
normalize_fa	Uses Unicode Normalization to normalize Persian characters in tokens.
normalize_hi	Uses Unicode Normalization to normalize Hindi characters in tokens.
normalize_in	Uses Unicode Normalization to normalize Indonesian characters in tokens.
possessive_en	Checks the second-last character in English-language tokenizer results for an apostrophe. If it finds an apostrophe, the token filter removes the last two characters from the token.
reverse	Reverses the tokens in tokenizer results. For example, the token filter converts the token `acrobat` to `taborca`.
stemmer_ar	Checks Arabic tokenizer results for suffixes and prefixes. If it finds a suffix or any prefixes, the token filter removes them to leave the root word.
stemmer_ckb	Checks Kurdish tokenizer results for prefixes. If it finds a prefix, the token filter removes it to leave the root word.
stemmer_da_snowball	Uses the Snowball string processing language to convert Danish language tokenizer results into word stems.
stemmer_de_light	Uses light stemming to convert German language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.
stemmer_de_snowball	Uses the Snowball string processing language to convert German language tokenizer results into word stems.
stemmer_en_snowball	Uses the Snowball string processing language to convert English language tokenizer results into word stems.
stemmer_es_light	Uses light stemming to convert Spanish language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.
stemmer_es_snowball	Uses the Snowball string processing language to convert Castilian Spanish language tokenizer results into word stems.
stemmer_fi_snowball	Uses the Snowball string processing language to convert Finnish language tokenizer results into word stems.
stemmer_fr_light	Uses light stemming to convert French language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.
stemmer_fr_min	Uses minimal stemming to convert French language tokenizer results. Minimal stemming only removes the last character of a word or replaces some suffixes. For example, the `stemmer_fr_min` removes `x`, `s`, `r`, `e`, and `é` characters from the end of words and replaces the `aux` suffix with `al`.
stemmer_fr_snowball	Uses the Snowball string processing language to convert French language tokenizer results into word stems.
stemmer_hi	Uses a lightweight stemmer for Hindi to remove suffixes from tokenizer results.
stemmer_hr	Uses an open source stemming rule set to find the root word in Croatian language tokenizer results.
stemmer_hu_snowball	Uses the Snowball string processing language to convert Hungarian language tokenizer results into word stems.
stemmer_it_light	Uses light stemming to convert Italian language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.
stemmer_it_snowball	Uses the Snowball string processing language to convert Italian language tokenizer results into word stems.
stemmer_nl_snowball	Uses the Snowball string processing language to convert Dutch language tokenizer results into word stems.
stemmer_no_snowball	Uses the Snowball string processing language to convert Norwegian language tokenizer results into word stems.
stemmer_porter	Transforms tokenizer results with the porter stemming algorithm. For more information, see the official Porter Stemming Algorithm documentation.
stemmer_pt_light	Uses light stemming to convert Portuguese language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.
stemmer_ro_snowball	Uses the Snowball string processing language to convert Romanian language tokenizer results into word stems.
stemmer_ru_snowball	Uses the Snowball string processing language to convert Russian language tokenizer results into word stems.
stemmer_sv_snowball	Uses the Snowball string processing language to convert Swedish language tokenizer results into word stems.
stemmer_tr_snowball	Uses the Snowball string processing language to convert Turkish language tokenizer results into word stems.
stop_ar	Removes tokens from tokenizer results that are unnecessary for a search, based on an Arabic dictionary.
stop_bg	Removes tokens from tokenizer results that are unnecessary for a search, based on a Bulgarian dictionary.
stop_ca	Removes tokens from tokenizer results that are unnecessary for a search, based on a Catalan dictionary.
stop_ckb	Removes tokens from tokenizer results that are unnecessary for a search, based on a Kurdish dictionary.
stop_cs	Removes tokens from tokenizer results that are unnecessary for a search, based on a Czech dictionary.
stop_da	Removes tokens from tokenizer results that are unnecessary for a search, based on a Danish dictionary.
stop_de	Removes tokens from tokenizer results that are unnecessary for a search, based on a German dictionary.
stop_el	Removes tokens from tokenizer results that are unnecessary for a search, based on a Greek dictionary.
stop_en	Removes tokens from tokenizer results that are unnecessary for a search, based on an English dictionary. For example, the token filter removes `and`, `is`, and `the` from tokenizer results.
stop_es	Removes tokens from tokenizer results that are unnecessary for a search, based on a Castilian Spanish dictionary.
stop_eu	Removes tokens from tokenizer results that are unnecessary for a search, based on a Basque dictionary.
stop_fa	Removes tokens from tokenizer results that are unnecessary for a search, based on a Persian dictionary.
stop_fi	Removes tokens from tokenizer results that are unnecessary for a search, based on a Finnish dictionary.
stop_fr	Removes tokens from tokenizer results that are unnecessary for a search, based on a French dictionary.
stop_ga	Removes tokens from tokenizer results that are unnecessary for a search, based on a Gaelic dictionary.
stop_gl	Removes tokens from tokenizer results that are unnecessary for a search, based on a Galician Spanish dictionary.
stop_he	Removes tokens from tokenizer results that are unnecessary for a search, based on a Hebrew dictionary.
stop_hi	Removes tokens from tokenizer results that are unnecessary for a search, based on a Hindi dictionary.
stop_hr	Removes tokens from tokenizer results that are unnecessary for a search, based on a Croatian dictionary.
stop_hu	Removes tokens from tokenizer results that are unnecessary for a search, based on a Hungarian dictionary.
stop_hy	Removes tokens from tokenizer results that are unnecessary for a search, based on an Armenian dictionary.
stop_id	Removes tokens from tokenizer results that are unnecessary for a search, based on an Indonesian dictionary.
stop_it	Removes tokens from tokenizer results that are unnecessary for a search, based on an Italian dictionary.
stop_nl	Removes tokens from tokenizer results that are unnecessary for a search, based on a Dutch dictionary.
stop_no	Removes tokens from tokenizer results that are unnecessary for a search, based on a Norwegian dictionary.
stop_pt	Removes tokens from tokenizer results that are unnecessary for a search, based on a Portuguese dictionary.
stop_ro	Removes tokens from tokenizer results that are unnecessary for a search, based on a Romanian dictionary.
stop_ru	Removes tokens from tokenizer results that are unnecessary for a search, based on a Russian dictionary.
stop_sv	Removes tokens from tokenizer results that are unnecessary for a search, based on a Swedish dictionary.
stop_tr	Removes tokens from tokenizer results that are unnecessary for a search, based on a Turkish dictionary.
to_lower	Converts all characters in tokens to lowercase.
unique	Removes any tokens that aren’t unique.

apostrophe

Removes all characters after an apostrophe (') from tokenizer results. Also removes the apostrophe. For example, the token Couchbase’s becomes Couchbase.

camelCase

Splits text in camelCase inside a token into separate tokens.

For example, the token filter splits the token camelCaseText into camel, Case, and Text.

cjk_bigram

Converts Chinese, Japanese, and Korean tokenizer results into bigrams, or groups of two consecutive words.

cjk_width

Converts Chinese, Japanese, and Korean tokenizer results from full width ASCII variants into Latin characters, and half-width katakana characters into their equivalent kana characters.

elision_ca

Removes all characters before an apostrophe from Catalan language tokenizer results. Also removes the apostrophe.

elision_fr

Removes all characters before an apostrophe from French language tokenizer results. Also removes the apostrophe.

For example, the token filter converts the token l’avion to avion.

elision_ga

Removes all characters before an apostrophe from Gaelic language tokenizer results. Also removes the apostrophe.

elision_it

Removes all characters before an apostrophe from Italian language tokenizer results. Also removes the apostrophe.

hr_suffix_transformation_filter

Replaces suffixes in Croatian tokenizer results with normalized suffixes.

lemmatizer_he

Lemmatizes similar forms of Hebrew words. Corrects spelling mistakes.

mark_he

Marks the Hebrew, non-Hebrew, and numeric tokens from tokenizer results.

niqqud_he

Forces niqqud-less spelling for Hebrew text in tokenizer results.

normalize_ar

Uses Unicode Normalization to normalize Arabic characters in tokens.

normalize_ckb

Uses Unicode Normalization to normalize Kurdish characters in tokens.

normalize_de

Uses Unicode Normalization to normalize German characters in tokens.

normalize_fa

Uses Unicode Normalization to normalize Persian characters in tokens.

normalize_hi

Uses Unicode Normalization to normalize Hindi characters in tokens.

normalize_in

Uses Unicode Normalization to normalize Indonesian characters in tokens.

possessive_en

Checks the second-last character in English-language tokenizer results for an apostrophe. If it finds an apostrophe, the token filter removes the last two characters from the token.

reverse

Reverses the tokens in tokenizer results. For example, the token filter converts the token acrobat to taborca.

stemmer_ar

Checks Arabic tokenizer results for suffixes and prefixes. If it finds a suffix or any prefixes, the token filter removes them to leave the root word.

stemmer_ckb

Checks Kurdish tokenizer results for prefixes. If it finds a prefix, the token filter removes it to leave the root word.

stemmer_da_snowball

Uses the Snowball string processing language to convert Danish language tokenizer results into word stems.

stemmer_de_light

Uses light stemming to convert German language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_de_snowball

Uses the Snowball string processing language to convert German language tokenizer results into word stems.

stemmer_en_snowball

Uses the Snowball string processing language to convert English language tokenizer results into word stems.

stemmer_es_light

Uses light stemming to convert Spanish language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_es_snowball

Uses the Snowball string processing language to convert Castilian Spanish language tokenizer results into word stems.

stemmer_fi_snowball

Uses the Snowball string processing language to convert Finnish language tokenizer results into word stems.

stemmer_fr_light

Uses light stemming to convert French language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_fr_min

Uses minimal stemming to convert French language tokenizer results.

Minimal stemming only removes the last character of a word or replaces some suffixes. For example, the stemmer_fr_min removes x, s, r, e, and é characters from the end of words and replaces the aux suffix with al.

stemmer_fr_snowball

Uses the Snowball string processing language to convert French language tokenizer results into word stems.

stemmer_hi

Uses a lightweight stemmer for Hindi to remove suffixes from tokenizer results.

stemmer_hr

Uses an open source stemming rule set to find the root word in Croatian language tokenizer results.

stemmer_hu_snowball

Uses the Snowball string processing language to convert Hungarian language tokenizer results into word stems.

stemmer_it_light

Uses light stemming to convert Italian language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_it_snowball

Uses the Snowball string processing language to convert Italian language tokenizer results into word stems.

stemmer_nl_snowball

Uses the Snowball string processing language to convert Dutch language tokenizer results into word stems.

stemmer_no_snowball

Uses the Snowball string processing language to convert Norwegian language tokenizer results into word stems.

stemmer_porter

Transforms tokenizer results with the porter stemming algorithm. For more information, see the official Porter Stemming Algorithm documentation.

stemmer_pt_light

Uses light stemming to convert Portuguese language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_ro_snowball

Uses the Snowball string processing language to convert Romanian language tokenizer results into word stems.

stemmer_ru_snowball

Uses the Snowball string processing language to convert Russian language tokenizer results into word stems.

stemmer_sv_snowball

Uses the Snowball string processing language to convert Swedish language tokenizer results into word stems.

stemmer_tr_snowball

Uses the Snowball string processing language to convert Turkish language tokenizer results into word stems.

stop_ar

Removes tokens from tokenizer results that are unnecessary for a search, based on an Arabic dictionary.

stop_bg

Removes tokens from tokenizer results that are unnecessary for a search, based on a Bulgarian dictionary.

stop_ca

Removes tokens from tokenizer results that are unnecessary for a search, based on a Catalan dictionary.

stop_ckb

Removes tokens from tokenizer results that are unnecessary for a search, based on a Kurdish dictionary.

stop_cs

Removes tokens from tokenizer results that are unnecessary for a search, based on a Czech dictionary.

stop_da

Removes tokens from tokenizer results that are unnecessary for a search, based on a Danish dictionary.

stop_de

Removes tokens from tokenizer results that are unnecessary for a search, based on a German dictionary.

stop_el

Removes tokens from tokenizer results that are unnecessary for a search, based on a Greek dictionary.

stop_en

Removes tokens from tokenizer results that are unnecessary for a search, based on an English dictionary. For example, the token filter removes and, is, and the from tokenizer results.

stop_es

Removes tokens from tokenizer results that are unnecessary for a search, based on a Castilian Spanish dictionary.

stop_eu

Removes tokens from tokenizer results that are unnecessary for a search, based on a Basque dictionary.

stop_fa

Removes tokens from tokenizer results that are unnecessary for a search, based on a Persian dictionary.

stop_fi

Removes tokens from tokenizer results that are unnecessary for a search, based on a Finnish dictionary.

stop_fr

Removes tokens from tokenizer results that are unnecessary for a search, based on a French dictionary.

stop_ga

Removes tokens from tokenizer results that are unnecessary for a search, based on a Gaelic dictionary.

stop_gl

Removes tokens from tokenizer results that are unnecessary for a search, based on a Galician Spanish dictionary.

stop_he

Removes tokens from tokenizer results that are unnecessary for a search, based on a Hebrew dictionary.

stop_hi

Removes tokens from tokenizer results that are unnecessary for a search, based on a Hindi dictionary.

stop_hr

Removes tokens from tokenizer results that are unnecessary for a search, based on a Croatian dictionary.

stop_hu

Removes tokens from tokenizer results that are unnecessary for a search, based on a Hungarian dictionary.

stop_hy

Removes tokens from tokenizer results that are unnecessary for a search, based on an Armenian dictionary.

stop_id

Removes tokens from tokenizer results that are unnecessary for a search, based on an Indonesian dictionary.

stop_it

Removes tokens from tokenizer results that are unnecessary for a search, based on an Italian dictionary.

stop_nl

Removes tokens from tokenizer results that are unnecessary for a search, based on a Dutch dictionary.

stop_no

Removes tokens from tokenizer results that are unnecessary for a search, based on a Norwegian dictionary.

stop_pt

Removes tokens from tokenizer results that are unnecessary for a search, based on a Portuguese dictionary.

stop_ro

Removes tokens from tokenizer results that are unnecessary for a search, based on a Romanian dictionary.

stop_ru

Removes tokens from tokenizer results that are unnecessary for a search, based on a Russian dictionary.

stop_sv

Removes tokens from tokenizer results that are unnecessary for a search, based on a Swedish dictionary.

stop_tr

Removes tokens from tokenizer results that are unnecessary for a search, based on a Turkish dictionary.

to_lower

Converts all characters in tokens to lowercase.

unique

Removes any tokens that aren’t unique.