Create a Custom Tokenizer

  • how-to
    +
    Create a custom tokenizer with the Couchbase Capella UI to change how the Search Service creates tokens for matching Search index content to a Search query.

    Prerequisites

    • You have the Search Service enabled on a node in your database. For more information about how to change Services on your database, see Modify a Database.

    • You have logged in to the Couchbase Capella UI.

    • You have started to create or already created an index in Advanced Mode.

    Procedure

    You can create 2 types of custom tokenizers:

    Tokenizer Type Description

    Regular expression

    The tokenizer uses any input that matches the regular expression to create new tokens.

    Exception

    The tokenizer removes any input that matches the regular expression, and creates tokens from the remaining input. You can choose another tokenizer to apply to the remaining input.

    Create a Regular Expression Tokenizer

    To create a regular expression tokenizer with the Capella UI:

    1. On the Databases page, select the database that has the Search index you want to edit.

    2. Go to Data Tools  Search.

    3. Click the index where you want to create a custom tokenizer.

    4. Under Advanced Settings, expand Custom Filters.

      Make sure you use Advanced Mode.
    5. Click Add Tokenizer.

    6. In the Name field, enter a name for the custom tokenizer.

    7. In the Type list, select regexp.

    8. In the Regular Expression field, enter the regular expression to use to split input into tokens.

    9. Click Submit.

    Create an Exception Custom Tokenizer

    To create an exception custom tokenizer with the Capella UI in Advanced Mode:

    1. On the Databases page, select the database that has the Search index you want to edit.

    2. Go to Data Tools  Search.

    3. Click the index where you want to create a custom tokenizer.

    4. Expand Custom Filters.

    5. Click Add Tokenizer.

    6. In the Name field, enter a name for the custom tokenizer.

    7. In the Type list, select exception.

    8. In the New Word field, enter a regular expression to use to remove content from input.

    9. To add the regular expression to the list of exception patterns, click Add.

    10. (Optional) To add additional regular expressions to the list of exception patterns, repeat the previous steps.

    11. In the Tokenizer for Remaining Input list, select a tokenizer to apply to input after removing any content that matches the regular expression.

      For more information about the available tokenizers, see Default Tokenizers.

    12. Click Submit.