Analyzers increase search-awareness by transforming input text into token-streams, which permit the management of richer and more finely controlled forms of text-matching. An analyzer consists of modules, each of which performs a particular, sequenced role in the transformation.
Analyzers pre-process input-text submitted for Full Text Search; typically, by removing characters that might prohibit certain match-options. Analysis is performed on document-contents when indexes are created; and is also performed on the input-text submitted for a search. The benefit of analysis is often referred to as language awareness.
For example, if the input-text for a search is
enjoyed staying here, and the document-content contains the phrase
many enjoyable activities, the dictionary-based words do not permit a match.
However, by using an analyzer that (by means of its inner Token Filter component) stems words, the input-text yields the tokens
here; while the document-content yields the tokens
By means of the common token
enjoy, this permits a match between
Since different analyzers pre-process text in different ways, effective Full Text Search depends on the right choice of analyzer, for the type of matches that are desired.
Couchbase Full Text Search provides a number of pre-constructed analyzers that can be used with Full Text Indexes. Additionally, analyzers can be custom-created, by means of the Couchbase Web Console. The remainder of this page explains the architecture of analyzers, and describes the modular components that Couchbase Full Text Search makes available for custom-creation. It also lists the pre-constructed analyzers that are available, and describes the modules that they contain.
For examples of both selecting and custom-creating analyzers by means of the Couchbase Web Console, see Creating Indexes.
Analyzers are built from modular components:
Character Filters remove undesirable characters from input: for example, the
htmlcharacter filter removes HTML tags, and indexes HTML text-content alone.
Tokenizers split input-strings into individual tokens, which together are made into a token stream. The nature of the decision-making whereby splits are made differs across tokenizers.
Token Filters are chained together, with each performing additional post-processing on each token in the stream provided by the tokenizer. This may include reducing tokens to the stems of the dictionary-based words from which they were derived, removing any remaining punctuation from tokens, and removing certain tokens deemed unnecessary.
Each component-type is described in more detail below. Note that these components can be used to custom-create an analyzer by means of the Couchbase Web Console. This is explained and exemplified in Creating Indexes.
Character Filters remove undesirable characters. The following filters are available:
html: Removes html elements such as
<p>, and decodes expressions such as
&to their appropriate text equivalent.
zero_width_spaces: Substitutes a regular space-character for each zero-width non-joiner space.
Tokenizers split input-strings into individual tokens: characters likely to prohibit certain kinds of matching (for example, spaces or commas) are omitted. The tokens so created are then made into a token stream for the query.
The following tokenizers are available from the Couchbase Web Console:
Letter: Creates tokens by breaking input-text into subsets that consist of letters only: characters such as punctuation-marks and numbers are omitted. Creation of a token ends whenever a non-letter character is encountered. For example, the text
Reqmnt: 7-element phrasewould return the following tokens:
Single: Creates a single token from the entirety of the input-text. For example, the text
in each placewould return the following token:
in each place. Note that this may be useful for handling URLs or email-addresses, which can thus be prevented from being broken at punctuation or special-character boundaries. It may also be used to prevent multi-word phrases (for example, placenames such as
San Francisco) from being broken up due to whitespace; so that they become indexed as a single term.
Web: Creates tokens by identifying and removing html tags. For example, the text
<h1>Introduction<\h1>would return the token
Whitespace: Creates tokens by breaking input-text into subsets according to where whitespace occurs. For example, the text
in each placewould return the following tokens:
Token Filters accept a token-stream provided by a tokenizer, and make modifications to the tokens in the stream.
A frequently used form of token filtering is stemming; this reduces words to a base form that typically consists of the initial stem of the word (for example,
play, which is the stem of
playable, and more).
With the stem used as the token, a wider variety of matches can be made (for example, the input-text
player can be matched with the document-content
The following kinds of token-filtering are supported by Couchbase Full Text Search:
dict_compoound: Allows user-specification of a dictionary whose words can be combined into compound forms, and individually indexed.
apostrophe: Removes all characters after an apostrophe, and the apostrophe itself. For example,
elision: Identifies and removes characters that prefix a term and are separated from it by an apostrophe. For example, in French,
edge_ngram: From each token, computes n-grams that are rooted either at the front or the back.
keyword_marker: Identifies keywords and marks them as such. These are then ignored by any downstream stemmer.
normalize_unicode: Converts tokens into Unicode Normalization Form.
ngram: From each token, computes n-grams. There are two parameters, which are the minimum and maximum n-gram length.
shingle: Computes multi-token shingles from the token stream. For example, the token stream
the quick brown fox, when configured with a shingle minimum and a shingle maximum length of 2, produces the tokens
quick brown, and
stemmer: Uses libstemmer to reduce tokens to word-stems.
stop_tokens: Removes from the stream tokens considered unnecessary for a Full Text Search: for example,
to_lower: Converts all characters to lower case. For example,
truncate: Truncates each token to a maximum-permissible token-length.
possessive: Removes English possessives. For example,
Note that token filters are frequently configured according to the special characteristics of individual languages. Couchbase Full Text Search provides multiple language-specific versions of the elision, normalize, stemmer, and stop token filters. Specially supported languages include ca (Catalan), fr (French), ga (Gaelic), it (Italian), ar (Arabic), ckb (Sorani Kurdish), fa (Persian), hi (Hindi), in (Indonesian), en (English), cs (Czech), el (Greek), eu (Basque), hy (Armenian), and pt (Portuguese). Additionally, token filters are provided for normalizing the width of and forming bigrams from tokens based on cjk (Chinese, Japanese, and Korean).
A number of pre-constructed analyzers are available, and can be selected from the Couchbase Web Console. For examples of selection, see Creating Indexes. The basic analyzers are as follows. See the sections above for details on the referenced analyzer-components.
keyword: Creates a single token representing the entire input, which is otherwise unchanged. This forces exact matches, and preserves characters such as spaces.
simple: Analysis by means of the Unicode tokenizer and the to_lower token filter.
standard: Analysis by means of the Unicode tokenizer, the to_lower token filter, and the stop token filter.
web: Analysis by means of the Web tokenizer and the to_lower token filter.
Additionally, a range of analyzers is provided for the specific support of certain languages. Each analyzer is named after the supported language: fr (French), it (Italian), ar (Arabic), ckb (Sorani Kurdish), ckj (Chinese, Japanese, and Korean), fa (Persian), hi (Hindi), and pt (Portuguese).