A newer version of this documentation is available.

View Latest

Default Tokenizers

  • reference
March 9, 2025
+ 12
Tokenizers control how the Search Service splits input strings into individual tokens.

You can use a tokenizer when you create a custom analyzer. Choose a default tokenizer or create your own.

The following default tokenizers are available:

Tokenizer Description

hebrew

Separates an input string into tokens that contain only Hebrew alphabet characters. Punctuation marks and numbers are excluded.

letter

Separates an input string into tokens that contain only Latin alphabet characters. Punctuation marks and numbers are excluded.

single

Creates a single token from the input string. Special characters and whitespace are preserved.

unicode

Separates input strings into tokens based on Unicode Word Boundaries.

web

Creates tokens from an input string that match email address, URL, Twitter username, and hashtag patterns.

whitespace

Separates an input string into tokens based on the location of whitespace characters.