Full Text Search using the SDK

    +
    You can use the Full Text Search service (FTS) to create queryable full-text indexes in Couchbase Server.

    Full Text Search or FTS allows you to create, manage, and query full text indexes on JSON documents stored in Couchbase buckets. It uses natural language processing for querying documents, provides relevance scoring on the results of your queries, and has fast indexes for querying a wide range of possible text searches. Some of the supported query types include simple queries like Match and Term queries; range queries like Date Range and Numeric Range; and compound queries for conjunctions, disjunctions, and/or boolean queries. The Python SDK exposes an API for performing FTS queries which abstracts some of the complexity of using the underlying REST API.

    Examples

    Here are all the imports required for the examples below:

    from couchbase.cluster import Cluster, ClusterOptions, PasswordAuthenticator
    from couchbase.exceptions import CouchbaseException
    from couchbase.search import QueryStringQuery, SearchQuery, SearchOptions, PrefixQuery, HighlightStyle, SortField, \
        SortScore, TermFacet
    from couchbase.mutation_state import MutationState
    
    import logging

    Search queries are executed at Cluster level (not bucket or collection). Here is a simple MatchQuery that looks for the text “swanky” using a defined index:

    import traceback
    try:
        result = cluster.search_query("index",
                                      QueryStringQuery("query"))
        for row in result.rows():
            print("Found row {}".format(row)
            print("Reported total rows: "
                  + result.metadata().metrics().total_rows())
    except CouchbaseException:
        logging.error(traceback.format_exc())

    A search query is always performed at the Cluster level, using the search_query method. It takes the name of the index and the type of query as required arguments and then allows the provision of additional options if needed (in the example above, no options are specified).

    Once a result returns you can iterate over the returned rows, and/or access the SearchMetaData associated with the query.

    Couchbase Search’s range of query types enable powerful searching using multiple options, to ensure results are just within the range wanted.

    Open Buckets and Cluster-Level Queries

    If you are using a cluster older than Couchbase Server 6.5, it is required that there is at least one bucket open before performing a cluster-level query. If you fail to do so, the SDK will return a FeatureNotAvailableException with a descriptive error message asking you to open one.

    Search Queries

    The second mandatory argument in the example above used SearchQuery.queryString("query") to specify the query to run against the search index. The query string is the simplest form, but there are many more available. The table below lists all of them with a short description of each. You can combine them with conjuncts and disjuncts respectively. Location objects are specified as a Tuple[SupportsFloat,SupportsFloat] of longitude and latitude respectively.

    Table 1. Available Search Queries
    Name Description

    QueryStringQuery(query: str)

    Accept query strings, which express query-requirements in a special syntax.

    MatchQuery(match: str)

    A match query analyzes input text, and uses the results to query an index.

    MatchPhraseQuery(match_phrase: str)

    The input text is analyzed, and a phrase query is built with the terms resulting from the analysis.

    PrefixQuery(prefix: str)

    A prefix query finds documents containing terms that start with the specified prefix.

    RegexQuery(regexp: str)

    A regexp query finds documents containing terms that match the specified regular expression.

    TermRangeQuery()

    A term range query finds documents containing a term in the specified field within the specified range.

    NumericRangeQuery()

    A numeric range query finds documents containing a numeric value in the specified field within the specified range.

    DateRangeQuery()

    A date range query finds documents containing a date value, in the specified field within the specified range.

    DisjunctionQuery(*queries: SearchQuery)

    A disjunction query contains multiple child queries. Its result documents must satisfy a configurable min number of child queries.

    ConjunctionQuery(*queries: SearchQuery)

    A conjunction query contains multiple child queries. Its result documents must satisfy all of the child queries.

    WildcardQuery(wildcard: str)

    A wildcard query uses a wildcard expression, to search within individual terms for matches.

    DocIdQuery(*ids: str)

    A doc ID query returns the indexed document or documents among the specified set.

    BooleanFieldQuery(value: bool)

    A boolean field query searches fields that contain boolean true or false values.

    TermQuery(term: str)

    Performs an exact match in the index for the provided term.

    PhraseQuery(*terms: str)

    A phrase query searches for terms occurring in the specified position and offsets.

    MatchAllQuery()

    Matches all documents in an index, irrespective of terms.

    MatchNoneQuery()

    Matches no documents in the index.

    GeoBoundingBoxQuery(top_left: Location, bottom_right: Location)

    Searches inside the given bounding box coordinates.

    GeoDistanceQuery(distance: str, location: Location )

    Searches inside the distance from the given location coordinate.

    Search Options

    The cluster.search_query function provides an array of named parameters to customize your query via **kwargs or SearchOptions. The following table lists them all:

    Table 2. Available Search Options
    Name Description

    limit: int

    Allows to limit the number of hits returned.

    skip: int

    Allows to skip the first N hits of the results returned.

    explain: bool

    Adds additional explain debug information to the result.

    scan_consistency: SearchScanConsistency

    Specifies a different consistency level for the result hits.

    consistent_with: MutationState

    Allows to be consistent with previously performed mutations.

    highlight_style: HighlightStyle

    Specifies highlighting rules for matched fields.

    highlight_fields: List[str]

    Specifies fields to highlight.

    sort: List[str]

    Allows to provide custom sorting rules.

    facets: Map[str, SearchFacet]

    Allows to fetch facets in addition to the regular hits.

    fields: List[str]

    Specifies fields to be included.

    raw: JSON

    Escape hatch to add arguments that are not covered by these options.

    Once the search query is executed successfully, the server starts sending back the resultant hits.

    result = cluster.search_query("my-index-name", PrefixQuery("airports-"), SearchOptions(fields=["field-1"]))
    
    for row in result.rows():
        print("Score: {}".format(row.score()))
        print("Document Id: {}".format(row.id()))
    
        # Also print fields that are included in the query
        print(row.fields())

    A conjunction query contains multiple child queries; its result documents must satisfy all of the child queries:

    Limit and Skip

    It is possible to limit the returned results to a maximum amount using the limit option. If you want to skip the first N records it can be done with the skip option.

    result = cluster.search_query(
        "index",
        QueryStringQuery("query"),
        SearchOptions(skip=3, limit=4)
    )

    Highlight

    It is possible to enable highlighting for matched fields. You can either rely on the default highlighting style or provide a specific one. The following snippet uses HTML formatting for two fields:

    result = cluster.search_query(
        "index",
        QueryStringQuery("query"),
        SearchOptions(highlight_style=HighlightStyle.HTML, highlight_fields=["field1", "field2"]))

    Sort

    By default the search engine will sort the results in descending order by score. This behavior can be modified by providing a different sorting order which can also be nested.

    result = cluster.search_query(
        "index",
        QueryStringQuery("query"),
        SearchOptions(sort=[SortScore(), SortField("field")]))

    Facets are aggregate information collected on a result set and are useful when it comes to categorization of result data. The SDK allows you to provide many different facet configurations to the search engine, the following example shows how to create a facet based on a term. Other possible facets include numeric and date ranges.

    Facets

    
    result = cluster.search_query(
        "index",
        QueryStringQuery("query"),
        SearchOptions(facets=dict(categories=TermFacet("category", 5)))
    )

    Fields

    You can tell the search engine to include the full content of a certain number of indexed fields in the response.

    result = cluster.search_query(
        "index",
        QueryStringQuery("query"),
        SearchOptions(fields=["field1", "field2"])
      )

    Working with Results

    The result of a search query has three components: hits, facets, and metdata. Hits are the documents that match the query. Facets allow the aggregation of information collected on a particular result set. Metdata holds additional information not directly related to your query, such as success total hits and how long the query took to execute in the cluster.

    Iterating hits
    for hit in result.hits():
        document_id = hit.id
        score = hit.score
    Iterating facets
    for facet in result.facets():
        name = facet.name
        total = facet.total

    The SearchRow contains the following methods:

    Table 3. SearchRow
    index() The name of the FTS index that gave this result.

    id()

    The id of the matching document.

    score()

    The score of this hit.

    explanation()

    If enabled provides an explanation in JSON form.

    locations()

    The individual locations of the hits as SearchRowLocations.

    fragments()

    The fragments for each field that was requested as highlighted.

    fields()

    Access to the returned fields

    Note that the SearchMetaData also contains potential errors, because the SDK will keep streaming results if the initial response came back successfully. This makes sure that even with partial data usually search results are useable, so if you absolutely need to check if all partitions are present in the result double check the error (and not only catch an exception on the query itself).

    Custom JSON Serializer

    As with all JSON APIs, it is possible to customize the JSON serializer. You can plug in your own library or custom configure mappings on your own JSON serializer. This in turn makes it possible to serialize rows into POJOs or other structures that your application defines, and which the SDK has no idea about.

    Please see the documentation on transcoding and serialization for more information.

    asyncio and Twisted APIs

    In addition to the blocking API on Cluster, the SDK provides Twisted and asyncio APIs on txcouchbase.cluster.TxCluster or acouchbase.cluster.Cluster respectively. If you are in doubt of which API to use, we recommend looking at acouchbase first: it builds on top of asyncio, the de-facto async framework for Python 3.4+.

    The Twisted API on the other hand exposes a Deferred object for use with Twisted-integrated code.