Create a Custom Token Filter
- Capella Operational
- how-to
Create a custom token filter with the Couchbase Capella UI to change how the Search Service creates tokens from Search index content and Search queries.
Token filters can improve your search results by removing characters from your Search index or Search queries that prevent matches.
Prerequisites
-
You have the Search Service enabled on a node in your operational cluster. For more information about how to change Services on your operational cluster, see Modify a Paid Cluster.
-
You have logged in to the Couchbase Capella UI.
-
You have started to create or already created an index in Advanced Mode Editing.
-
You have already created or started to create a custom analyzer in your Search index.
Procedure
To create a custom token filter with the Capella UI in Advanced Mode:
-
On the Operational Clusters page, select the operational cluster where you want to work with the Search Service.
-
Go to
. -
Click the name of the index where you want to create a custom analyzer.
-
Make sure to select Enable Advanced Options.
-
Expand Global Index Settings.
-
Do one of the following:
-
To create a new custom analyzer with a new token filter, click Add Custom Analyzer.
-
To add a new custom token filter to use with an existing analyzer, expand the Default Analyzer list, and next to your custom analyzer, click Edit.
-
-
Click Add Custom Token Filter.
-
In the Token Filter Name field, enter a name for the token filter.
-
In the Type list, select a token filter type.
You can create any of the following custom token filters:
Token Filter Type | Description |
---|---|
Use a word list to find and create tokens from compound words in existing tokens. |
|
Use a set character length to create tokens from the start or end of existing tokens. |
|
Use a word list to remove elisions from input tokens. |
|
Use a word list of keywords to find and create new tokens. |
|
Use a set character length to filter out tokens that are too long or too short. |
|
Use a set character length to create new tokens. |
|
Use Unicode Normalization to convert tokens. |
|
Use a set character length and separator to concatenate and create new tokens. |
|
Use a word list to find and remove words from tokens. |
|
Use a set character length to truncate existing tokens. |
Create a Custom dict_compound
Token Filter
A dict_compound
token filter uses a wordlist to find subwords inside an input token.
If the token filter finds a subword inside a compound word, it turns it into a separate token.
For example, if you had a wordlist that contained play
and jump
, the token filter converts playful jumping
into two tokens: play
and jump
.
To create a new dict_compound
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select dict_compound.
-
In the Sub Words list, do one of the following to configure how the token filter finds subwords to create new tokens:
-
Choose an available default word list.
-
Click Create Custom Word List.
-
In the Word List Name field, enter a name for the new custom word list.
-
In the Add Words field, enter each word you want to add to your custom word list, separated by commas (
,
). -
Click Create Custom Word List.
-
-
-
Click Add Custom Token Filter.
Create a Custom edge_ngram
Token Filter
An edge_ngram
token filter uses a specified range to create new tokens.
You can also choose whether to create the new token from the start or backward from the end of the input token.
For example, if you had a miminum of four and a maximum of five with an input token of breweries
, the token filter creates the tokens brew
and brewe
.
To create a new edge_ngram
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select edge_ngram.
-
Do one of the following:
-
To create new tokens starting from the end of input tokens, select Back.
-
To create new tokens starting from the beginning of input tokens, clear Back.
-
-
In the Min box, enter the minimum character length for a new token.
-
In the Max box, enter the maximum character length for a new token.
-
Click Add Custom Token Filter.
Create a Custom elision
Token Filter
An elision
token filter removes elisions from input tokens.
For example, if you had the stop_fr
wordlist in an elision token filter, the token je m’appelle John
becomes the tokens je
, appelle
, and John
.
To create a new elision
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select elision.
-
In the Articles list, do one of the following to choose how to find elisions in input tokens:
-
Choose an available default word list.
-
Click Create Custom Word List.
-
In the Word List Name field, enter a name for the new custom word list.
-
In the Add Words field, enter each word you want to add to your custom word list, separated by commas (
,
). -
Click Create Custom Word List.
-
-
-
Click Add Custom Token Filter.
Create a Custom keyword_marker
Token Filter
A keyword_marker
token filter finds keywords in an input token and turns them into tokens.
For example, if you had a wordlist that contained the keyword beer
, the token beer and breweries
becomes the token beer
.
To create a new keyword_marker
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select keyword_marker.
-
In the Articles list, do one of the following to choose how to find keywords to create tokens:
-
Choose an available default word list.
-
Click Create Custom Word List.
-
In the Word List Name field, enter a name for the new custom word list.
-
In the Add Words field, enter each word you want to add to your custom word list, separated by commas (
,
). -
Click Create Custom Word List.
-
-
-
Click Add Custom Token Filter.
Create a Custom length
Token Filter
A length
token filter removes tokens that are shorter or longer than a set character length.
For example, if you had a range with a minimum of two and a maximum of four, the token beer and breweries
becomes the tokens beer
and and
.
To create a new length
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select length.
-
In the Min box, enter the minimum character length for a new token.
-
In the Max box, enter the maximum character length for a new token.
-
Click Add Custom Token Filter.
Create a Custom ngram
Token Filter
An ngram
token filter uses a specified character length to split an input token into new tokens.
For example, if you had a range with a minimum of four and a maximum of five, the token beers
becomes the tokens beer
, beers
, and eers
.
To create a new ngram
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select ngram.
-
In the Min box, enter the minimum character length for a new token.
-
In the Max box, enter the maximum character length for a new token.
-
Click Add Custom Token Filter.
Create a Custom normalize_unicode
Token Filter
A normalize_unicode
token filter uses a specified Unicode Normalization form to create new tokens.
To create a new normalize_unicode
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select normalize_unicode.
-
In the Form list, select the type of Unicode normalization to apply:
-
nfc: Use canonical decomposition and canonical composition to normalize characters. The token filter separates combined unicode characters, then merges them into a single character.
-
nfd: Use canonical decomposition to normalize characters. The token filter separates combined unicode characters.
-
nfkc: Use compatibility decomposition to normalize characters. The token filter converts unicode characters to remove variants.
-
nfkd: Use compatibility decomposition and canonical composition to normalize characters. The token filter removes variants, then separates combined unicode characters to merge them into a single character.
-
-
Click Add Custom Token Filter.
Create a Custom shingle
Token Filter
A shingle
token filter uses a specified character length and separator to create new tokens.
For example, if you use a whitespace tokenizer, a range with a minimum of two and a maximum of three, and a space as a separator, the token abc def
becomes abc
, def
, and abc def
.
To create a new shingle
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select shingle.
-
In the Min box, enter the minimum character length for a new token before concatenation.
-
In the Max box, enter the maximum character length for a new token before concatenation.
-
Do one of the following:
-
To include the original token as an output token, select Include original token.
-
To remove the original token from output, clear Include original token.
-
-
(Optional) In the Separator field, enter a character or characters to add in between concatenated tokens.
-
(Optional) In the Filler field, enter a character or characters to replace tokens that are removed by another token filter.
-
Click Add Custom Token Filter.
Create a Custom stop_tokens
Token Filter
A stop_tokens
token filter uses a wordlist to remove specific tokens from input.
For example, if you have a wordlist that contains the word and
, the token beers and breweries
becomes beers
and breweries
.
To create a new stop_tokens
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select stop_tokens.
-
In the Stop Words list, do one of the following to choose what word list the token filter should use to remove tokens:
-
Choose an available default word list.
-
Click Create Custom Word List.
-
In the Word List Name field, enter a name for the new custom word list.
-
In the Add Words field, enter each word you want to add to your custom word list, separated by commas (
,
). -
Click Create Custom Word List.
-
-
-
Click Add Custom Token Filter.
Create a Custom truncate_token
Token Filter
A truncate_token
token filter uses a specified character length to shorten any input tokens that are too long.
For example, if you had a length
of four, the token beer and breweries
becomes beer
, and
, and brewe
.
To create a new truncate_token
token filter with the Capella UI in Advanced Mode:
-
In the Type list, select truncate_token.
-
In the Length box, enter the maximum character length for an output token.
-
Click Add Custom Token Filter.
Next Steps
After you create a custom token filter, you can use it with a custom analyzer.
To continue customizing your Search index, you can also:
To run a search and test the contents of your Search index, see Run A Simple Search with the Capella UI.