Default Token Filters
- reference
Use a token filter to filter a tokenizer’s results and get better search result matches.
The Search Service’s token filters work with tokenizers to filter search input tokens. Tokens can come from the content of your Search index or a Search query.
For more information about token filters, see Token Filters.
The following token filters are available:
Token Filter Type | Description |
---|---|
apostrophe |
Removes all characters after an apostrophe (') from tokenizer results. Also removes the apostrophe. For example, the token |
camelCase |
Splits text in camelCase inside a token into separate tokens. For example, the token filter splits the token |
cjk_bigram |
Converts Chinese, Japanese, and Korean tokenizer results into bigrams, or groups of two consecutive words. |
cjk_width |
Converts Chinese, Japanese, and Korean tokenizer results from full width ASCII variants into Latin characters, and half-width katakana characters into their equivalent kana characters. |
elision_ca |
Removes all characters before an apostrophe from Catalan language tokenizer results. Also removes the apostrophe. |
elision_fr |
Removes all characters before an apostrophe from French language tokenizer results. Also removes the apostrophe. For example, the token filter converts the token |
elision_ga |
Removes all characters before an apostrophe from Gaelic language tokenizer results. Also removes the apostrophe. |
elision_it |
Removes all characters before an apostrophe from Italian language tokenizer results. Also removes the apostrophe. |
hr_suffix_transformation_filter |
Replaces suffixes in Croatian tokenizer results with normalized suffixes. |
lemmatizer_he |
Lemmatizes similar forms of Hebrew words. Corrects spelling mistakes. |
mark_he |
Marks the Hebrew, non-Hebrew, and numeric tokens from tokenizer results. |
niqqud_he |
Forces niqqud-less spelling for Hebrew text in tokenizer results. |
normalize_ar |
Uses Unicode Normalization to normalize Arabic characters in tokens. |
normalize_ckb |
Uses Unicode Normalization to normalize Kurdish characters in tokens. |
normalize_de |
Uses Unicode Normalization to normalize German characters in tokens. |
normalize_fa |
Uses Unicode Normalization to normalize Persian characters in tokens. |
normalize_hi |
Uses Unicode Normalization to normalize Hindi characters in tokens. |
normalize_in |
Uses Unicode Normalization to normalize Indonesian characters in tokens. |
possessive_en |
Checks the second-last character in English-language tokenizer results for an apostrophe. If it finds an apostrophe, the token filter removes the last two characters from the token. |
reverse |
Reverses the tokens in tokenizer results. For example, the token filter converts the token |
stemmer_ar |
Checks Arabic tokenizer results for suffixes and prefixes. If it finds a suffix or any prefixes, the token filter removes them to leave the root word. |
stemmer_ckb |
Checks Kurdish tokenizer results for prefixes. If it finds a prefix, the token filter removes it to leave the root word. |
stemmer_da_snowball |
Uses the Snowball string processing language to convert Danish language tokenizer results into word stems. |
stemmer_de_light |
Uses light stemming to convert German language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics. |
stemmer_de_snowball |
Uses the Snowball string processing language to convert German language tokenizer results into word stems. |
stemmer_en_snowball |
Uses the Snowball string processing language to convert English language tokenizer results into word stems. |
stemmer_es_light |
Uses light stemming to convert Spanish language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics. |
stemmer_es_snowball |
Uses the Snowball string processing language to convert Castilian Spanish language tokenizer results into word stems. |
stemmer_fi_snowball |
Uses the Snowball string processing language to convert Finnish language tokenizer results into word stems. |
stemmer_fr_light |
Uses light stemming to convert French language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics. |
stemmer_fr_min |
Uses minimal stemming to convert French language tokenizer results. Minimal stemming only removes the last character of a word or replaces some suffixes. For example, the |
stemmer_fr_snowball |
Uses the Snowball string processing language to convert French language tokenizer results into word stems. |
stemmer_hi |
Uses a lightweight stemmer for Hindi to remove suffixes from tokenizer results. |
stemmer_hr |
Uses an open source stemming rule set to find the root word in Croatian language tokenizer results. |
stemmer_hu_snowball |
Uses the Snowball string processing language to convert Hungarian language tokenizer results into word stems. |
stemmer_it_light |
Uses light stemming to convert Italian language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics. |
stemmer_it_snowball |
Uses the Snowball string processing language to convert Italian language tokenizer results into word stems. |
stemmer_nl_snowball |
Uses the Snowball string processing language to convert Dutch language tokenizer results into word stems. |
stemmer_no_snowball |
Uses the Snowball string processing language to convert Norwegian language tokenizer results into word stems. |
stemmer_porter |
Transforms tokenizer results with the porter stemming algorithm. For more information, see the official Porter Stemming Algorithm documentation. |
stemmer_pt_light |
Uses light stemming to convert Portuguese language tokenizer results into word stems. Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem. Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics. |
stemmer_ro_snowball |
Uses the Snowball string processing language to convert Romanian language tokenizer results into word stems. |
stemmer_ru_snowball |
Uses the Snowball string processing language to convert Russian language tokenizer results into word stems. |
stemmer_sv_snowball |
Uses the Snowball string processing language to convert Swedish language tokenizer results into word stems. |
stemmer_tr_snowball |
Uses the Snowball string processing language to convert Turkish language tokenizer results into word stems. |
stop_ar |
Removes tokens from tokenizer results that are unnecessary for a search, based on an Arabic dictionary. |
stop_bg |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Bulgarian dictionary. |
stop_ca |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Catalan dictionary. |
stop_ckb |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Kurdish dictionary. |
stop_cs |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Czech dictionary. |
stop_da |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Danish dictionary. |
stop_de |
Removes tokens from tokenizer results that are unnecessary for a search, based on a German dictionary. |
stop_el |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Greek dictionary. |
Removes tokens from tokenizer results that are unnecessary for a search, based on an English dictionary. For example, the token filter removes |
|
stop_es |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Castilian Spanish dictionary. |
stop_eu |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Basque dictionary. |
stop_fa |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Persian dictionary. |
stop_fi |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Finnish dictionary. |
stop_fr |
Removes tokens from tokenizer results that are unnecessary for a search, based on a French dictionary. |
stop_ga |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Gaelic dictionary. |
stop_gl |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Galician Spanish dictionary. |
stop_he |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Hebrew dictionary. |
stop_hi |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Hindi dictionary. |
stop_hr |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Croatian dictionary. |
stop_hu |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Hungarian dictionary. |
stop_hy |
Removes tokens from tokenizer results that are unnecessary for a search, based on an Armenian dictionary. |
stop_id |
Removes tokens from tokenizer results that are unnecessary for a search, based on an Indonesian dictionary. |
stop_it |
Removes tokens from tokenizer results that are unnecessary for a search, based on an Italian dictionary. |
stop_nl |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Dutch dictionary. |
stop_no |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Norwegian dictionary. |
stop_pt |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Portuguese dictionary. |
stop_ro |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Romanian dictionary. |
stop_ru |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Russian dictionary. |
stop_sv |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Swedish dictionary. |
stop_tr |
Removes tokens from tokenizer results that are unnecessary for a search, based on a Turkish dictionary. |
Converts all characters in tokens to lowercase. |
|
unique |
Removes any tokens that aren’t unique. |