A Couchbase Cluster’s metadata describes its configuration. Some categories of metadata are maintained by means of a consensus protocol; others by means of gossip replication.
Understanding Couchbase-Server Metadata-Management
In Couchbase Server version 7.0 and later, metadata is managed by means of Chronicle; which is a consensus-based system, based on the Raft algorithm. Chronicle manages:
The node-list for the cluster.
The status of each node.
The service-map for the cluster, which indicates on which nodes particular services have been installed.
Bucket definitions, including the placement of scopes and collections.
The vBucket maps for the cluster.
The process whereby this metadata is maintained is described below, in Consensus-Based Metadata-Management.
Additional cluster-metadata is handled by Couchbase Server’s legacy metadata-management system, which is based on gossip-replication. This legacy-managed metadata includes:
Settings and metadata for all services.
Per-node configuration settings.
Changes to the legacy-managed metadata are asynchronously replicated across the nodes with eventual consistency, by means of gossip replication. This means that when a metadata change occurs on any node, that node attempts to replicate the change to all other nodes. Additionally, each node periodically pulls the configuration of some other, randomly selected node; in order to update its own configuration in cases where it has somehow missed an earlier change-notification. By these means, configuration generally spreads rapidly and reliably through the cluster.
When a cluster-wide activity such as rebalance or failover needs to occur, this is performed by the cluster’s orchestrator (its master node). To do so, the orchestrator obtains a lease on metadata changes; to ensure that no topology-related metadata changes can be made by any other node, while the cluster-wide activity is in process.
Starting with version 7.0, Couchbase Server provides consensus-based metadata management, by means of Chronicle; which is a methodology based on the Raft algorithm. Chronicle is:
Supportive of full linearizability.
Fully tested with Jepsen.
Couchbase-Server administrators are not required to be familiar with the internal workings of Chronicle. However, for informational purposes, a sketch of how Chronicle works is provided below. See the Raft specification for a complete account of the algorithm on which Chronicle is based.
Logs, Leaders, and Followers
Each node maintains a log, into which metadata change-commands are saved. Across the cluster, logs are maintained with consistency. At intervals, to control log-size and ensure availability, on each node, compaction is performed, and snapshots are taken.
Each node is considered to be in the role of either leader or follower; with at most one node the leader at any time. Only the leader can transmit metadata change-commands. The role of leader may periodically be exchanged between nodes, by the process described below.
Leadership and Election
The leader is responsible for communicating with clients: this includes providing the current topology of the cluster, and receiving requests for topology change. The leader distributes clients' requests to the followers, as change-commands, which are to be appended to the log-instance on each node.
In the event of the leader becoming non-communicative (say, due to network failure), a follower can advertise itself as a candidate for leadership. Followers, receiving the communication, respond with votes. Every vote received by the candidate, including its own, constitutes one node’s support for the candidacy. If a majority of nodes are supportive, a new term is started, with the elected node as leader.
Nodes are given the right to vote only when fully integrated into the cluster. Nodes not fully integrated (as is the case, for example, during the process of their addition) are considered replicas. Replicas participate in the exchange and commitment of information (see immediately below), but do not vote. Once a node’s addition is complete, the node is able to vote.
Data Exchange and Commitment
When the leader transmits a change-command to followers, each follower appends the command to its own instance of the log. Once a majority of nodes confirm that they have done so, the leader performs a commit; and informs the other nodes. Once informed, the other nodes also commit.
Committing means updating the Replicated State Machine, according to the command. This is described below.
Following commitment, the leader returns the execution-result to the client.
Network Failures and Consequent Inconsistencies
When network failures prevent change-commands from reaching a majority of nodes, the information is appended to the logs of those nodes that have received it; but is not committed. Whenever a leader commits a new log entry, it also commits all preceding uncommitted entries in its log; and each commitment is applied, across the cluster, to all instances of the Replicated State Machine.
Replicated State Machine
The Replicated State Machine is a key-value store, resident on each node, that contains the cluster’s topographical data. Clients that require such data receive it from the leader’s key-value store. The key-value pairs in the store are generated and updated based on change-commands appended to the replicated log. A change-command may:
Add a key-value pair.
Update the value of a key-value pair.
Update the value of a key-value pair, based on logical constraints.
Update the values of key-value pairs transactionally.
Each key-value pair has a revision number that corresponds to the sequence of change-commands applied to the store.
A quorum failure occurs when half or more of the nodes in the cluster cannot be contacted. In this situation, commitment is prohibited; until either the communication problem is remedied, or the non-communicative nodes are failed over unsafely. In consequence, prior to remedy or failover, for the duration of the quorum failure:
Buckets, scopes, and collections can neither be created nor dropped.
Nodes cannot be added, joined, failed over safely, or removed.
See Performing an Unsafe Failover, for information on failing over nodes in response to a quorum failure.