Create a Custom Token Filter

  • how-to
    +
    Create a custom token filter with the Couchbase Server Web Console to change how the Search Service creates tokens from Search index content and Search queries.

    Token filters can improve your search results by removing characters from your Search index or Search queries that prevent matches.

    Prerequisites

    Procedure

    To create a custom token filter with the Couchbase Server Web Console:

    1. Go to Search.

    2. Click the Search index where you want to create a custom token filter.

    3. Click Edit.

    4. Expand Customize Index  Custom Filters.

    5. Click Add Token Filter.

    6. In the Name field, enter a name for the token filter.

    You can create any of the following custom token filters:

    Token Filter Type Description

    dict_compound

    Use a wordlist to find and create tokens from compound words in existing tokens.

    edge_ngram

    Use a set character length to create tokens from the start or end of existing tokens.

    elision

    Use a wordlist to remove elisions from input tokens.

    keyword_marker

    Use a wordlist of keywords to find and create new tokens.

    length

    Use a set character length to filter out tokens that are too long or too short.

    ngram

    Use a set character length to create new tokens.

    normalize_unicode

    Use Unicode Normalization to convert tokens.

    shingle

    Use a set character length and separator to concatenate and create new tokens.

    stop_tokens

    Use a wordlist to find and remove words from tokens.

    truncate_token

    Use a set character length to truncate existing tokens.

    Create a Custom dict_compound Token Filter

    A dict_compound token filter uses a wordlist to find subwords inside an input token. If the token filter finds a subword inside a compound word, it turns it into a separate token.

    For example, if you had a wordlist that contained play and jump, the token filter converts playful jumping into two tokens: play and jump.

    dict

    To create a new dict_compound token filter with the Couchbase Server Web Console:

    1. In the Type field, select dict_compound.

    2. In the Sub Words list, select the wordlist to use to find subwords in input tokens.

      You can choose your own custom wordlist or a default wordlist. Each subword match creates a new token.

    3. Click Save.

    Create a Custom edge_ngram Token Filter

    An edge_ngram token filter uses a specified range to create new tokens. You can also choose whether to create the new token from the start or backward from the end of the input token.

    For example, if you had a miminum of four and a maximum of five with an input token of breweries, the token filter creates the tokens brew and brewe.

    edge

    To create a new edge_ngram token filter with the Couchbase Server Web Console:

    1. In the Type field, select edge_ngram.

    2. Do one of the following:

      1. To create new tokens starting from the end of input tokens, select Back.

      2. To create new tokens starting from the beginning of input tokens, clear Back.

    3. In the Min field, enter the minimum character length for a new token.

    4. In the Max field, enter the maximum character length for a new token.

    5. Click Save.

    Create a Custom elision Token Filter

    An elision token filter removes elisions from input tokens.

    For example, if you had the stop_fr wordlist in an elision token filter, the token je m’appelle John becomes the tokens je, appelle, and John.

    elision

    To create a new elision token filter with the Couchbase Server Web Console:

    1. In the Type field, select elision.

    2. In the Articles list, select a wordlist to use to find elisions in input tokens.

      You can choose your own custom wordlist or a default wordlist.

    3. Click Save.

    Create a Custom keyword_marker Token Filter

    A keyword_marker token filter finds keywords in an input token and turns them into tokens.

    For example, if you had a wordlist that contained the keyword beer, the token beer and breweries becomes the token beer.

    keyword

    To create a new keyword_marker token filter with the Couchbase Server Web Console:

    1. In the Type field, select keyword_marker.

    2. In the Keywords list, select a wordlist to use to find keywords to create tokens.

      You can choose your own custom wordlist or a default wordlist.

    3. Click Save.

    Create a Custom length Token Filter

    A length token filter removes tokens that are shorter or longer than a set character length.

    For example, if you had a range with a minimum of two and a maximum of four, the token beer and breweries becomes the tokens beer and and.

    length

    To create a new length token filter with the Couchbase Server Web Console:

    1. In the Type field, select length.

    2. In the Min field, enter the minimum character length for a new token.

    3. In the Max field, enter the maximum character length for a new token.

    4. Click Save.

    Create a Custom ngram Token Filter

    An ngram token filter uses a specified character length to split an input token into new tokens.

    For example, if you had a range with a minimum of four and a maximum of five, the token beers becomes the tokens beer, beers, and eers.

    ngram

    To create a new ngram token filter with the Couchbase Server Web Console:

    1. In the Type field, select ngram.

    2. In the Min field, enter the minimum character length for a new token.

    3. In the Max field, enter the maximum character length for a new token.

    4. Click Save.

    Create a Custom normalize_unicode Token Filter

    A normalize_unicode token filter uses a specified Unicode Normalization form to create new tokens.

    To create a new normalize_unicode token filter with the Couchbase Server Web Console:

    1. In the Type field, select normalize_unicode.

    2. In the Form list, select the type of Unicode normalization to apply:

      • nfc: Use canonical decomposition and canonical composition to normalize characters. The token filter separates combined unicode characters, then merges them into a single character.

      • nfd: Use canonical decomposition to normalize characters. The token filter separates combined unicode characters.

      • nfkc: Use compatibility decomposition to normalize characters. The token filter converts unicode characters to remove variants.

      • nfkd: Use compatibility decomposition and canonical composition to normalize characters. The token filter removes variants, then separates combined unicode characters to merge them into a single character.

    3. Click Save.

    Create a Custom shingle Token Filter

    A shingle token filter uses a specified character length and separator to create new tokens.

    For example, if you use a whitespace tokenizer, a range with a minimum of two and a maximum of three, and a space as a separator, the token abc def becomes abc, def, and abc def.

    shingle

    To create a new shingle token filter with the Couchbase Server Web Console:

    1. In the Type field, select shingle.

    2. In the Min field, enter the minimum character length for a new token before concatenation.

    3. In the Max field, enter the maximum character length for a new token before concatenation.

    4. Do one of the following:

      1. To include the original token as an output token, select Include original token.

      2. To remove the original token from output, clear Include original token.

    5. (Optional) In the Separator field, enter a character or characters to add in between concatenated tokens.

    6. (Optional) In the Filler field, enter a character or characters to replace tokens that are removed by another token filter.

    7. Click Save.

    Create a Custom stop_tokens Token Filter

    A stop_tokens token filter uses a wordlist to remove specific tokens from input.

    For example, if you have a wordlist that contains the word and, the token beers and breweries becomes beers and breweries.

    stop

    To create a new stop_tokens token filter with the Couchbase Server Web Console:

    1. In the Type field, select stop_tokens.

    2. In the Stop Words list, select a wordlist to use to remove tokens.

      You can choose your own custom wordlist or a default wordlist.

    3. Click Save.

    Create a Custom truncate_token Token Filter

    A truncate_token token filter uses a specified character length to shorten any input tokens that are too long.

    For example, if you had a length of four, the token beer and breweries becomes beer, and, and brewe.

    truncate

    To create a new truncate_token token filter with the Couchbase Server Web Console:

    1. In the Type field, select truncate_token.

    2. In the Length field, enter the maximum character length for an output token.

    3. Click Save.