Default Tokenizers

Tokenizers control how the Search Service splits input strings into individual tokens.

You can use a tokenizer when you create a custom analyzer. Choose a default tokenizer or create your own.

The following default tokenizers are available:

Tokenizer	Description
hebrew	Separates an input string into tokens that contain only Hebrew alphabet characters. Punctuation marks and numbers are excluded.
letter	Separates an input string into tokens that contain only Latin alphabet characters. Punctuation marks and numbers are excluded.
single	Creates a single token from the input string. Special characters and whitespace are preserved.
unicode	Separates input strings into tokens based on Unicode Word Boundaries.
web	Creates tokens from an input string that match email address, URL, Twitter username, and hashtag patterns.
whitespace	Separates an input string into tokens based on the location of whitespace characters.

Tokenizer

Description

hebrew

Separates an input string into tokens that contain only Hebrew alphabet characters. Punctuation marks and numbers are excluded.

letter

Separates an input string into tokens that contain only Latin alphabet characters. Punctuation marks and numbers are excluded.

single

Creates a single token from the input string. Special characters and whitespace are preserved.

unicode

Separates input strings into tokens based on Unicode Word Boundaries.

web

Creates tokens from an input string that match email address, URL, Twitter username, and hashtag patterns.

whitespace

Separates an input string into tokens based on the location of whitespace characters.