A newer version of this documentation is available.

View Latest

Understanding Analyzers

      +
      Analyzers increase search-awareness by transforming input text into token-streams, which permit the management of richer and more finely controlled forms of text-matching. An analyzer consists of modules, each of which performs a particular, sequenced role in the transformation.

      Principles of Text-Analysis

      Analyzers pre-process input-text submitted for Full Text Search; typically, by removing characters that might prohibit certain match-options. Analysis is performed on document-contents when indexes are created; and is also performed on the input-text submitted for a search. The benefit of analysis is often referred to as language awareness.

      For example, if the input-text for a search is enjoyed staying here, and the document-content contains the phrase many enjoyable activities, the dictionary-based words do not permit a match. However, by using an analyzer that (by means of its inner Token Filter component) stems words, the input-text yields the tokens enjoy, stay, and here; while the document-content yields the tokens enjoy and activ. By means of the common token enjoy, this permits a match between enjoyed and enjoyable.

      Since different analyzers pre-process text in different ways, effective Full Text Search depends on the right choice of analyzer, for the type of matches that are desired.

      Couchbase Full Text Search provides a number of pre-constructed analyzers that can be used with Full Text Indexes. Additionally, analyzers can be custom-created, by means of the Couchbase Web Console. The remainder of this page explains the architecture of analyzers, and describes the modular components that Couchbase Full Text Search makes available for custom-creation. It also lists the pre-constructed analyzers that are available, and describes the modules that they contain.

      For examples of both selecting and custom-creating analyzers by means of the Couchbase Web Console, see Creating Indexes.

      Analyzer Architecture

      Analyzers are built from modular components:

      • Character Filters remove undesirable characters from input: for example, the html character filter removes HTML tags, and indexes HTML text-content alone.

      • Tokenizers split input-strings into individual tokens, which together are made into a token stream. The nature of the decision-making whereby splits are made differs across tokenizers.

      • Token Filters are chained together, with each performing additional post-processing on each token in the stream provided by the tokenizer. This may include reducing tokens to the stems of the dictionary-based words from which they were derived, removing any remaining punctuation from tokens, and removing certain tokens deemed unnecessary.

      Each component-type is described in more detail below. Note that these components can be used to custom-create an analyzer by means of the Couchbase Web Console. This is explained and exemplified in Creating Indexes.

      Character Filters

      Character Filters remove undesirable characters. The following filters are available:

      • asciifolding: Converts characters that are in the first 127 ASCII characters (basic latin unicode block) into their ASCII equivalents.

      • html: Removes html elements such as <p>.

      • regexp: Uses a regular expression to match characters which should be replaced with the specified replacement string.

      • zero_width_spaces: Substitutes a regular space-character for each zero-width non-joiner space.

      Tokenizers

      Tokenizers split input-strings into individual tokens: characters likely to prohibit certain kinds of matching (for example, spaces or commas) are omitted. The tokens so created are then made into a token stream for the query.

      The following tokenizers are available from the Couchbase Web Console:

      • Letter: Creates tokens by breaking input-text into subsets that consist of letters only: characters such as punctuation-marks and numbers are omitted. Creation of a token ends whenever a non-letter character is encountered. For example, the text Reqmnt: 7-element phrase would return the following tokens: Reqmnt, element, and phrase.

      • Single: Creates a single token from the entirety of the input-text. For example, the text in each place would return the following token: in each place. Note that this may be useful for handling URLs or email-addresses, which can thus be prevented from being broken at punctuation or special-character boundaries. It may also be used to prevent multi-word phrases (for example, placenames such as Milton Keynes or San Francisco) from being broken up due to whitespace; so that they become indexed as a single term.

      • Unicode: Creates tokens by performing Unicode Text Segmentation on word-boundaries, using the segment library. For examples, see Unicode Word Boundaries.

      • Web: Creates tokens by identifying and removing html tags. For example, the text <h1>Introduction<\h1> would return the token Introduction.

      • Whitespace: Creates tokens by breaking input-text into subsets according to where whitespace occurs. For example, the text in each place would return the following tokens: in, each, and place.

      Token Filters

      Token Filters accept a token-stream provided by a tokenizer, and make modifications to the tokens in the stream.

      A frequently used form of token filtering is stemming; this reduces words to a base form that typically consists of the initial stem of the word (for example, play, which is the stem of player, playing, playable, and more). With the stem used as the token, a wider variety of matches can be made (for example, the input-text player can be matched with the document-content playable).

      The following kinds of token-filtering are supported by Couchbase Full Text Search:

      • apostrophe: Removes all characters after an apostrophe, and the apostrophe itself. For example, they’ve becomes they.

      • camelCase: Splits camelCase text to tokens.

      • dict_compound: Allows user-specification of a dictionary whose words can be combined into compound forms, and individually indexed.

      • edge_ngram: From each token, computes n-grams that are rooted either at the front or the back.

      • elision: Identifies and removes characters that prefix a term and are separated from it by an apostrophe. For example, in French, l’avion becomes avion.

      • keyword_marker: Identifies keywords and marks them as such. These are then ignored by any downstream stemmer.

      • length: Removes tokens that are too short or too long for the stream.

      • to_lower: Converts all characters to lower case.

      • ngram: From each token, computes n-grams. There are two parameters, which are the minimum and maximum n-gram length.

      • reverse: Simply reverses each token.

      • shingle: Computes multi-token shingles from the token stream. For example, the token stream the quick brown fox, when configured with a shingle minimum and a shingle maximum length of 2, produces the tokens the quick, quick brown, and brown fox.

      • stemmer_porter: Transforms the token stream as per the porter stemming algorithm.

      • stemmer_snowball: Uses libstemmer to reduce tokens to word-stems.

      • stop_tokens: Removes from the stream tokens considered unnecessary for a Full Text Search. For example, and, is, and the. For example, HTML becomes html.

      • truncate: Truncates each token to a maximum-permissible token-length.

      • normalize_unicode: Converts tokens into Unicode Normalization Form.

      • unique: Only indexes unique tokens during analysis.

      Note that token filters are frequently configured according to the special characteristics of individual languages. Couchbase Full Text Search provides multiple language-specific versions of the elision, normalize, stemmer, possessive, and stop token filters. Specially supported languages are shown in the table immediately below.

      Table 1. Supported Token-Filter Languages
      Name Language

      ar

      Arabic

      bg

      Bulgarian

      ca

      Catalan

      cjk

      Chinese | Japanese | Korean

      ckb

      Kurdish

      da

      Danish

      de

      German

      el

      Greek

      en

      English

      es

      Spanish (Castilian)

      eu

      Basque

      fa

      Persian

      fi

      Finnish

      fr

      French

      ga

      Gaelic

      gl

      Spanish (Galician)

      hi

      Hindi

      hu

      Hungarian

      hy

      Armenian

      id, in

      Indonesian

      it

      Italian

      nl

      Dutch

      no

      Norwegian

      pt

      Portuguese

      ro

      Romanian

      ru

      Russian

      sv

      Swedish

      tr

      Turkish

      Pre-Constructed Analyzers

      A number of pre-constructed analyzers are available, and can be selected from the Couchbase Web Console. For examples of selection, see Creating Indexes. The basic analyzers are as follows. See the sections above for details on the referenced analyzer-components.

      • keyword: Creates a single token representing the entire input, which is otherwise unchanged. This forces exact matches, and preserves characters such as spaces.

      • simple: Analysis by means of the Unicode tokenizer and the to_lower token filter.

      • standard: Analysis by means of the Unicode tokenizer, the to_lower token filter, and the stop token filter.

      • web: Analysis by means of the Web tokenizer and the to_lower token filter.

      Additionally, a range of analyzers is provided for the specific support of certain languages; as shown by the table immediately below.

      Table 2. Supported Analyzer Languages
      Name Language

      ar

      Arabic

      cjk

      Chinese | Japanese | Korean

      ckb

      Kurdish

      da

      Danish

      de

      German

      en

      English

      es

      Spanish (Castilian)

      fa

      Persian

      fi

      Finnish

      fr

      French

      hi

      Hindi

      hu

      Hungarian

      it

      Italian

      nl

      Dutch

      no

      Norwegian

      pt

      Portuguese

      ro

      Romanian

      ru

      Russian

      sv

      Swedish

      tr

      Turkish

      Analysis Wizard

      We host a service where you can test the functionality of our stock analyzers or a custom configured one with the tokenizers, token filters and character filters we offer here (powered by bleve).