A newer version of this documentation is available.

View Latest

Default Token Filters

  • reference
March 16, 2025
+ 12
Use a token filter to filter a tokenizer’s results and get better search result matches.

The Search Service’s token filters work with tokenizers to filter search input tokens. Tokens can come from the content of your Search index or a Search query.

For more information about token filters, see Token Filters.

The following token filters are available:

Token Filter Type Description

apostrophe

Removes all characters after an apostrophe (') from tokenizer results. Also removes the apostrophe. For example, the token Couchbase’s becomes Couchbase.

camelCase

Splits text in camelCase inside a token into separate tokens.

For example, the token filter splits the token camelCaseText into camel, Case, and Text.

cjk_bigram

Converts Chinese, Japanese, and Korean tokenizer results into bigrams, or groups of two consecutive words.

cjk_width

Converts Chinese, Japanese, and Korean tokenizer results from full width ASCII variants into Latin characters, and half-width katakana characters into their equivalent kana characters.

elision_ca

Removes all characters before an apostrophe from Catalan language tokenizer results. Also removes the apostrophe.

elision_fr

Removes all characters before an apostrophe from French language tokenizer results. Also removes the apostrophe.

For example, the token filter converts the token l’avion to avion.

elision_ga

Removes all characters before an apostrophe from Gaelic language tokenizer results. Also removes the apostrophe.

elision_it

Removes all characters before an apostrophe from Italian language tokenizer results. Also removes the apostrophe.

hr_suffix_transformation_filter

Replaces suffixes in Croatian tokenizer results with normalized suffixes.

lemmatizer_he

Lemmatizes similar forms of Hebrew words. Corrects spelling mistakes.

mark_he

Marks the Hebrew, non-Hebrew, and numeric tokens from tokenizer results.

niqqud_he

Forces niqqud-less spelling for Hebrew text in tokenizer results.

normalize_ar

Uses Unicode Normalization to normalize Arabic characters in tokens.

normalize_ckb

Uses Unicode Normalization to normalize Kurdish characters in tokens.

normalize_de

Uses Unicode Normalization to normalize German characters in tokens.

normalize_fa

Uses Unicode Normalization to normalize Persian characters in tokens.

normalize_hi

Uses Unicode Normalization to normalize Hindi characters in tokens.

normalize_in

Uses Unicode Normalization to normalize Indonesian characters in tokens.

possessive_en

Checks the second-last character in English-language tokenizer results for an apostrophe. If it finds an apostrophe, the token filter removes the last two characters from the token.

reverse

Reverses the tokens in tokenizer results. For example, the token filter converts the token acrobat to taborca.

stemmer_ar

Checks Arabic tokenizer results for suffixes and prefixes. If it finds a suffix or any prefixes, the token filter removes them to leave the root word.

stemmer_ckb

Checks Kurdish tokenizer results for prefixes. If it finds a prefix, the token filter removes it to leave the root word.

stemmer_da_snowball

Uses the Snowball string processing language to convert Danish language tokenizer results into word stems.

stemmer_de_light

Uses light stemming to convert German language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_de_snowball

Uses the Snowball string processing language to convert German language tokenizer results into word stems.

stemmer_en_snowball

Uses the Snowball string processing language to convert English language tokenizer results into word stems.

stemmer_es_light

Uses light stemming to convert Spanish language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_es_snowball

Uses the Snowball string processing language to convert Castilian Spanish language tokenizer results into word stems.

stemmer_fi_snowball

Uses the Snowball string processing language to convert Finnish language tokenizer results into word stems.

stemmer_fr_light

Uses light stemming to convert French language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_fr_min

Uses minimal stemming to convert French language tokenizer results.

Minimal stemming only removes the last character of a word or replaces some suffixes. For example, the stemmer_fr_min removes x, s, r, e, and é characters from the end of words and replaces the aux suffix with al.

stemmer_fr_snowball

Uses the Snowball string processing language to convert French language tokenizer results into word stems.

stemmer_hi

Uses a lightweight stemmer for Hindi to remove suffixes from tokenizer results.

stemmer_hr

Uses an open source stemming rule set to find the root word in Croatian language tokenizer results.

stemmer_hu_snowball

Uses the Snowball string processing language to convert Hungarian language tokenizer results into word stems.

stemmer_it_light

Uses light stemming to convert Italian language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_it_snowball

Uses the Snowball string processing language to convert Italian language tokenizer results into word stems.

stemmer_nl_snowball

Uses the Snowball string processing language to convert Dutch language tokenizer results into word stems.

stemmer_no_snowball

Uses the Snowball string processing language to convert Norwegian language tokenizer results into word stems.

stemmer_porter

Transforms tokenizer results with the porter stemming algorithm. For more information, see the official Porter Stemming Algorithm documentation.

stemmer_pt_light

Uses light stemming to convert Portuguese language tokenizer results into word stems.

Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

stemmer_ro_snowball

Uses the Snowball string processing language to convert Romanian language tokenizer results into word stems.

stemmer_ru_snowball

Uses the Snowball string processing language to convert Russian language tokenizer results into word stems.

stemmer_sv_snowball

Uses the Snowball string processing language to convert Swedish language tokenizer results into word stems.

stemmer_tr_snowball

Uses the Snowball string processing language to convert Turkish language tokenizer results into word stems.

stop_ar

Removes tokens from tokenizer results that are unnecessary for a search, based on an Arabic dictionary.

stop_bg

Removes tokens from tokenizer results that are unnecessary for a search, based on a Bulgarian dictionary.

stop_ca

Removes tokens from tokenizer results that are unnecessary for a search, based on a Catalan dictionary.

stop_ckb

Removes tokens from tokenizer results that are unnecessary for a search, based on a Kurdish dictionary.

stop_cs

Removes tokens from tokenizer results that are unnecessary for a search, based on a Czech dictionary.

stop_da

Removes tokens from tokenizer results that are unnecessary for a search, based on a Danish dictionary.

stop_de

Removes tokens from tokenizer results that are unnecessary for a search, based on a German dictionary.

stop_el

Removes tokens from tokenizer results that are unnecessary for a search, based on a Greek dictionary.

stop_en

Removes tokens from tokenizer results that are unnecessary for a search, based on an English dictionary. For example, the token filter removes and, is, and the from tokenizer results.

stop_es

Removes tokens from tokenizer results that are unnecessary for a search, based on a Castilian Spanish dictionary.

stop_eu

Removes tokens from tokenizer results that are unnecessary for a search, based on a Basque dictionary.

stop_fa

Removes tokens from tokenizer results that are unnecessary for a search, based on a Persian dictionary.

stop_fi

Removes tokens from tokenizer results that are unnecessary for a search, based on a Finnish dictionary.

stop_fr

Removes tokens from tokenizer results that are unnecessary for a search, based on a French dictionary.

stop_ga

Removes tokens from tokenizer results that are unnecessary for a search, based on a Gaelic dictionary.

stop_gl

Removes tokens from tokenizer results that are unnecessary for a search, based on a Galician Spanish dictionary.

stop_he

Removes tokens from tokenizer results that are unnecessary for a search, based on a Hebrew dictionary.

stop_hi

Removes tokens from tokenizer results that are unnecessary for a search, based on a Hindi dictionary.

stop_hr

Removes tokens from tokenizer results that are unnecessary for a search, based on a Croatian dictionary.

stop_hu

Removes tokens from tokenizer results that are unnecessary for a search, based on a Hungarian dictionary.

stop_hy

Removes tokens from tokenizer results that are unnecessary for a search, based on an Armenian dictionary.

stop_id

Removes tokens from tokenizer results that are unnecessary for a search, based on an Indonesian dictionary.

stop_it

Removes tokens from tokenizer results that are unnecessary for a search, based on an Italian dictionary.

stop_nl

Removes tokens from tokenizer results that are unnecessary for a search, based on a Dutch dictionary.

stop_no

Removes tokens from tokenizer results that are unnecessary for a search, based on a Norwegian dictionary.

stop_pt

Removes tokens from tokenizer results that are unnecessary for a search, based on a Portuguese dictionary.

stop_ro

Removes tokens from tokenizer results that are unnecessary for a search, based on a Romanian dictionary.

stop_ru

Removes tokens from tokenizer results that are unnecessary for a search, based on a Russian dictionary.

stop_sv

Removes tokens from tokenizer results that are unnecessary for a search, based on a Swedish dictionary.

stop_tr

Removes tokens from tokenizer results that are unnecessary for a search, based on a Turkish dictionary.

to_lower

Converts all characters in tokens to lowercase.

unique

Removes any tokens that aren’t unique.