Search Index Architecture
- concept
The Search Service works together with the Data Service to process and index your data for Search.
Search indexes use Synchronization with Database Change Protocol (DCP) and the Data Service and Search Index Segments to manage merging and persisting data to disk in your cluster. All changes from Database Change Protocol (DCP) and the Data Service are introduced to a Search index in batches, which are further managed by segments.
In general, the Search Service uses the Bleve indexing and search framework.
Search Index Structure
A basic Search index is a list of all of the unique terms that appear in the documents stored in your cluster. For each term, the Search index also contains a list of the documents where that term appears. This list is called an inverted index.
The inverted index can be smaller or larger than your original documents, depending on the complexity of your data.
Specific options in your Search index configuration can also increase its size, such as Store, Include in _all field, and Include Term Vectors. For more information about what options can increase index size and storage requirements, see Child Field Options.
You can also control how these lists of terms, or tokens, are generated by using an analyzer.
The inverted index, term dictionary, or the vector index are stored in Search Index Segments, which are sent to the disk write queue and saved to disk through synchronization with the Data Service.
Synchronization with Database Change Protocol (DCP) and the Data Service
The Search Service uses batches to process data that comes in from DCP and the Data Service. DCP and Data Service changes are introduced gradually, based on available memory on Search Service nodes.
The Search Service can merge batches into a single batch before they’re sent to the disk write queue, to reduce the resources required for batch processing.
The Search Service maintains index snapshots on each Search index partition. These snapshots contain a representation of document mutations on either a write queue, or in storage.
Losing Connection with the Data Service
If the Search Service loses connection to the Data Service, the Search Service sends a connection request from the last update, or sequence number, of updates it persisted.
If the index snapshots on the Search Service are too far ahead compared to the Data Service’s sequence numbers, the Search Service recovers sequence numbers from earlier index snapshots. The Search Service then creates stream requests to bring the data in your Search indexes back in sync with the Data Service.
Search Index Segments
Search indexes are built with data segments.
All Search indexes contain a root index snapshot, or the collection of segments that hold only the latest version of available data from the Data Service.
The data in newer segments can overwrite the data in older segments. The Search Service maintains time-spaced snapshots to support partial index rollbacks, in case of data sync issues between the Data Service and the Search Service. See Losing Connection with the Data Service.
The stale segments in a snapshot are eventually removed by the Search Service’s persister or merger routines — unless these segments are needed to restore an index snapshot.
The persister reads in-memory segments from the disk write queue and flushes them to disk, completing batch operations as part of Synchronization with Database Change Protocol (DCP) and the Data Service. The merger flushes consolidated files to disk and updates the root index snapshot.
The persister and merger interact to continuously flush and merge new in-memory segments to disk, and remove stale segments.
Segments are marked as stale when they’re replaced by a new merged segment created by the merger. Stale segments are deleted when they’re no longer used by any new queries.
Search Request Processing
The Search Service uses a scatter-gather process for running all Search queries, when there are multiple nodes in the cluster running the Search Service.
The Search Service node that receives the Search request is assigned as the coordinating node. Using gRPC, the coordinating node scatters the request to all other partitions for the Search index in the request across other nodes. The coordinating node applies filters to the results received from the other partitions, and returns the final result set.
Results are scored, and based on the Sort Object provided in the Search request, returned in a list.
For more information about how results are scored and returned for Search requests, see Scoring for Search Queries.