You are viewing the documentation for a prerelease version.

View Latest

Hard Failover

Hard failover takes an unresponsive node out of the cluster.

Understanding Hard Failover

Hard failover takes an unresponsive node out of the cluster. Hard failover can be performed on any node in the cluster: slightly different actions are taken for each affected service, as described below.

Hard failover should only be used when a node has become unresponsive: it should not be used as part of regular, planned maintenance activities that can be conducted with all cluster-nodes available. For information on taking nodes out of the cluster in such circumstances, see Removal and Graceful Failover.

Hard failover can be initiated by means of:

  • Couchbase Web Console, manually.

  • The CLI or REST API; either manually, or from a script/program/ops platform.

  • Automatic Failover, triggered by the Cluster Manager: see Automatic Failover.

For further information on initiating hard failover, see Perform Hard Failover.

Effect of Hard Failover on Services

Hard failover has a different effect on each Couchbase Service.

Data Service

Since active vBuckets have become unavailable, the Cluster Manager promotes replica vBuckets to active status on the remaining cluster-nodes. A revised cluster-map is then communicated to all clients.

When the process is complete, no further attempts will be made to access vBuckets on the failed-over node, but the node has not yet been fully taken out of the cluster. The cluster is at this point considered to be in a degraded state, in terms of availability; as it now contains a reduced number of replica vBuckets, and may lack resources for handling a further node-outage. Rebalance should be performed.

Index Service

When Global Secondary Indexes (GSIs) are defined, each is by default created on only one node, which is running the Index Service. If that node fails, the indexes therefore become unavailable. If the node is subsequently repaired and added back into the cluster by means of Delta Recovery, the indexes are updated, and become available again. If the node is replaced, the indexes need to be recreated: in this case, before rebalance of the new node into the cluster, the failed node must be taken out of the cluster by means of hard failover.

Query Service

Query Service nodes are stateless, and can be added to and removed from the cluster with no consequence to data. However, ongoing queries on those nodes are interrupted by hard failover, and therefore produce errors.

As long as one Query Service node continues to be available, the cluster continues to support querying, although potentially with reduced performance.

Search Service

If multiple nodes run the Search Service, Full Text Indexes are partitioned, and are automatically distributed across those nodes. If a Search Service node is failed over, it stops taking traffic. If no other nodes run the Search Service, all building of Full Text Indexes stops, and searches fail. If at least one other node is running the Search Service, this continues to handle queries and return partial results.

When rebalance occurs:

  • If replica Full Text Indexes have been configured, the Search Service promotes the replicas to active on the remaining nodes of the cluster.

  • If replica Full Text Indexes have not been configured, the Search Service rebuilds the indexes on the remaining nodes of the cluster, using stored index definitions.

Note that the Search Service is not supported by Delta Recovery.

Eventing Service

If a node hosting the Eventing Service undergoes hard failover, the service on that node stops, and ongoing mutations on the node are interrupted: this potentially results in data loss. If the node is restored to the cluster, and the Eventing Service is restarted, mutations resume: however, this may result in the processing of mutations that are duplicates of mutations made prior to failover, leading to inappropriate changes to data.

If multiple cluster-nodes host the Eventing Service, the Eventing Service on each of the remaining nodes continues to function in the absence of a failed-over Eventing-Service node. Note that under these conditions, vBucket reallocations that occur due to failover may lead to mutations being interrupted, and therefore potentially to data loss.

Analytics Service

The Analytics Service uses shadow data, which is a single copy of a subset of the data maintained by the Data Service. The shadow data is not replicated; however, its single copy is partitioned across all cluster nodes that run the Analytics Service.

If any single Analytics-Service node undergoes hard failover, the Analytics Service and all analytics processing stop, cluster-wide. If the lost Analytics-Service node is restored to the cluster, and the service is restarted, no rebuilding of shadow data is necessary, and analytics processing resumes across the Analytics-Service nodes of the cluster. However, if a lost Analytics-Service node is permanently removed or replaced, all shadow data must be rebuilt, if and when the Analytics Service is restarted.

Returning the Cluster to a Stable State

If or when the failed node is repaired and ready, it can be added back to the cluster via Delta or Full Recovery. Alternatively, an entirely new node can be added instead.

  • Delta Recovery can be performed when the Cluster Manager recognizes the node as a previous member of the cluster. If Delta Recovery fails, Full Recovery must be performed.

    When a node is added back to the cluster using Delta Recovery, the replica vBuckets on the failed-over node are considered to be trusted, but behind on data. The Cluster Manager therefore resynchronizes the vBuckets, so that their data becomes current. When this operation is complete, vBuckets are promoted to active status as appropriate, and the cluster map is updated.

  • If the node is added back using Full Recovery, the node is treated as an entirely new node: it is reloaded with data, and requires rebalance.

  • If the node cannot be added back, a new node can be added, and the cluster rebalanced.

Prior to rebalance, a cluster should always be restored to an appropriate size and topology. Note that a rebalance performed prior to the re-adding of a failed over node prevents Delta Recovery.

Hard Failover Example

Given:

  • A cluster containing four nodes, each of which runs the Data Service

  • A single replica configured per bucket, such that 256 active and 256 replica vBuckets therefore reside on each node

  • Node 4 of the cluster, on which vBucket #762 resides, offline and apparently unrecoverable

The following occur:

  1. Clients attempting reads and writes on node 4 receive errors or timeouts.

  2. Hard failover is initiated, either manually or automatically, to remove node 4.

  3. The Cluster Manager promotes the replica vBucket 762 to active status, on node 2. The cluster now has no replica for vBucket 762.

  4. The Cluster Map is updated, so that clients' subsequent reads and writes will go to the correct location for vBucket 762, now node #2.

The same process is repeated for the remaining 255 vBuckets. It is then repeated for the remaining 255 vBuckets of the bucket, one bucket at a time.

Hard Failover and Multiple Nodes

Unless Server Group Awareness is in operation, multiple nodes should not be failed over simultaneously; unless enough replica vBuckets exist on the remaining nodes to support required promotions to active status, and the number and capacity of the remaining nodes allow continued cluster-operation. If two nodes are to be failed over, two replicas per bucket are required, to prevent data-loss.

Unrecognized Non-Availability

In rare cases, the Cluster Manager fails to recognize the unavailability of a node. In such cases, if graceful failover does not succeed, hard should be performed.