Automatic Failover

One or more nodes can be failed over automatically when they become unresponsive or experience a configured auto-failover triggering event. An auto-failover is performed only if all safety check conditions are met and the checks are done to maintain data safety; i.e. that no data loss occurs as a result of failover.

Understanding Automatic Failover

Important: The concept of auto-failover, and failover in general, applies to nodes in a single cluster, including nodes in a Server Group. There are no separate settings for Server Group failover or Server Group auto-failover. If auto-failover settings are configured properly, and if the cluster services topology allows, auto-failover of all the nodes in a Server Group can occur. For example, if an event occurs so that all of the nodes in a Server Group were to become unavailable so that they were eligible for auto-failover, for auto-failover of all the nodes in the Server Group to occur, all of the auto-failover constraints listed in Auto-failover Constraints must be met, including the majority quorum requirement: i.e., the remaining nodes must be able to form a majority quorum to be able to initiate an auto-failover.

See Server Group Awareness for more information about Server Groups.

Note that node failover in the Couchbase Server is in the context of a single cluster, and auto-failover only occurs in a single cluster. In the context of Cross Data Center Replication (XDCR), the failover refers to application failover to a different cluster. Application failovers are always determined and controlled by the user.

Automatic Failover — or auto-failover — can be configured to fail over one or more nodes automatically. No immediate administrator intervention is required. Specifically, the Cluster Manager autonomously detects and verifies that the nodes are unresponsive, and then initiates the hard failover process. Auto-failover does not fix or identify problems that may have occurred. Once appropriate fixes have been applied to the cluster by the administrator, a rebalance is required. Auto-failover is always a hard failover. For information on how services are affected by hard failover, see Hard Failover.

As a reminder, failover is a mechanism in the Couchbase Server that allows a node to be taken out of the cluster so that applications no longer reference the services on the failed node and availability is maintained. The failover is at the node level, and the automatic failover process for a non-responsive or an unhealthy node starts when the cluster manager detects, per the configured auto-failover settings, that the node is unresponsive (the cluster manager of the node is not sending heartbeats to the cluster manager of other nodes) or the Data or the Index Service on a node is not healthy (the service heartbeat or process is not responding to the cluster manager). Then, multiple safety checks are run to see if an auto-failover can be performed. If all checks pass, the cluster manager performs the hard failover process.

In the Failover Events section, the events that start the auto-failover process are described. In the Auto-Failover Constraints section, the safety checks done after an auto-failover triggering event are explained. These safety checks are done to determine if an auto-failover should actually be performed. Checks required for service specific safety policies are described in the Service-Specific Auto-Failover Policy section.

For information on managing auto-failover, see the information provided for Couchbase Web Console at General (which provides information on general cluster-settings), for the REST API at Managing Auto-Failover, and for the CLI at setting-autofailover.

Failover Events

Auto-failover occurs in response to failed/failing events. The following events can trigger auto-failover:

Node failure. A server-node within the cluster is unresponsive (due to a network failure, very high CPU utilization problem, out-of-memory problem, or other node-specific issue). This means that the the cluster manager of the node has not sent heartbeats in the configured timeout period, and therefore, the health of the services running on the node is unknown.
Multiple node failure. Concurrent correlated failure of multiple nodes such as physical rack of machines or multiple virtual machines sharing a host.
Data Service disk read/write issues. Data Service disk read/write errors. Attempts by the Data Service to read from or write to disk on a particular node have resulted in a significant rate of failure (errors returned), for longer than a specified time-period.

Data Disk non-responsiveness. You can configure a timeout period for the Data Service’s disk read/write threads to complete an operation. When you enable this setting, if the period elapses and the thread has not completed the operation, Couchbase Server can auto-fail over the node. This setting differs from the disk error timeout because the data disk does not have to return errors. Instead, if the disk is so slow that it cannot complete the operation or is hanging, Couchbase Server can take action even when it does not receive an error.
Index or Data Service running on the mode is non-responsive or unhealthy.
- Index Service non-responsiveness. Index Service running on a node sends heartbeat messages to the cluster manager as an indication of its health. If the Index Service fails to send a heartbeat, it is considered unhealthy, and if it stays unhealthy for the user-specified threshold time for auto-failover, the cluster manager will start the auto-failover checks for the node that the index service is on.
- Data Service is unhealthy. Besides the Data Service disk read/write issues configured monitoring for auto-failover, the Data Service running on a node can be deemed unhealthy per various other internal monitoring. If the Data Service stays unhealthy for the user-specified threshold time for auto-failover, the cluster manager will start the auto-failover checks for the node that the data service is on.

Note that the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness (see Configuring Auto-Failover) — this is the user-specified threshold time for auto-failover mentioned in the Data Service and Index Service monitoring.

Note that on a node where there are only Search, Eventing, Query, Analytics, or Backup services running, the services could become unhealthy, but as long as the cluster manager heartbeats are sent and processed by the rest of the cluster, an auto-failover of the node will not be attempted — this is because only the Data and Index Services health are monitored for node auto-failover.

Auto-Failover Constraints

If a monitored or configured auto-failover event occurs, an auto-failover will not be performed if all the safety checks do not pass. These checks are explained in this section and the Service-Specific Auto-Failover Policy section.

The quorum constraint is a critical part of auto-failover since the cluster must be able to form a quorum to initiate a failover, following the failure of some of the nodes. For Server Groups, this means that if you have two server groups with equal number of nodes, for auto-failover of all nodes in one server group to be able to occur, you could deploy an arbiter node (or another node) in a third physical server group which will allow the remaining nodes to form a quorum.

Another critical auto-failover constraint for Server Groups is the maximum number of nodes to be automatically failed over (maxCount in /settings/autoFailover) before administrator-intervention is required. If you want one entire server group of nodes to be able to be all automatically failed over, then the maxCount value should be at least the number of nodes in the server group. You can check the value of maxCount in GET /settings/autoFailover to see what the maxCount setting is. The value of count in the same GET /settings/autoFailover output tells you how many node auto-failovers have occurred since the parameter was last reset. Running a rebalance will reset the count value back to 0. The count should not be reset manually unless guided by Support, since resetting manually will cause you to lose track of the number of auto-failovers that have already occurred without the cluster being rebalanced.

The list below describes other conditions that must be met for an auto-failover to be executed even after a monitored or configured auto-failover event has occurred.

If the majority of nodes in the cluster can form a quorum to initiate failover, following the failure of one of the nodes. For example, given a cluster of 18 nodes, 10 nodes are required for the quorum; and given a cluster of 17 nodes, 9 nodes are required for the quorum.
Only up to an administrator-specified maximum number of nodes. After this maximum number of auto-failovers has been reached, no further auto-failover occurs, until the count is manually reset by the administrator, or until a rebalance is successfully performed. Note, however, that the count can be manually reset, or a rebalance performed, prior to the maximum number being reached.
By default, Couchbase Server does not allow an auto-failover if it may result in data loss. For example, with default settings Couchbase Server does not allow the auto-failover of a node that contains a bucket with no replicas. This restriction includes ephemeral buckets as well as Couchbase buckets. See Auto-Failover and Ephemeral Buckets for more information on auto-failover and ephemeral buckets.
Only in accordance with the Service-Specific Auto-Failover Policy for the service or services on the unresponsive node.
If the cluster has been configured to preserve durable writes, only if the failover cannot result in the loss of durably written data. See Preserving Durable Writes.

Note that auto-failover should be configured only when the cluster contains sufficient resources to handle all possible consequences: workload-intensity on remaining nodes may increase significantly.

Auto-failover is for intra-cluster use only: it does not work with Cross Data Center Replication (XDCR).

Auto-failover may take significantly longer if the unresponsive node is that on which the orchestrator is running; since time-outs must occur, before available nodes can elect a new orchestrator-node and thereby continue. Faster failover can be achieved by deploying an arbiter node, which is a node that hosts no Couchbase service. See Fast Failover.

See Alerts, for details on configuring email alerts related to failover.

See Server Group Awareness, for information about server groups.

Auto-Failover and Ephemeral Buckets

Couchbase Server supports ephemeral buckets, which are buckets that it stores only in memory. Their data is never persisted to disk. This lack of persistence poses several challenges when it comes to node failure.

If an ephemeral bucket lacks replicas, it loses the data in vBuckets on any node that fails and restarts. To prevent this data loss, by default Couchbase Server does not allow auto-failover of a node that contains vBuckets for an unreplicated ephemeral bucket. In this case, you must manually fail over the node if it’s unresponsive. However, all of the ephemeral bucket’s data on the node is lost.

Couchbase Server provides two settings that affect how node failures work with ephemeral buckets:

Auto-reprovisioning for Ephemeral Buckets: This setting helps avoid data loss in cases where a node fails and restarts before Couchbase Server can begin an auto-failover for it. This setting defaults to enabled. When it’s enabled, Couchbase Server automatically activates the replicas of any ephemeral vBuckets that are active on the restarting node. If you turn off this setting, there’s a risk that the restarting node could cause data loss. It could roll back asynchronous writes that the replica vBuckets have but its vBuckets are missing.

Auto-failover for Ephemeral Buckets with No Replicas ENTERPRISE EDITION

When enabled, this setting allows Couchbase Server to auto-failover a node that contains vBuckets for an ephemeral bucket with no replicas. When Couchbase Server fails over a node with an unreplicated ephemeral bucket, the data in the vBuckets on the node is lost. Couchbase Server creates empty vBuckets on the remaining nodes to replace the missing vBuckets on the failed node. When the failed node rejoins the cluster, Couchbase Server moves the replacement vBuckets back to it.

This setting is off by default. When it’s off, Couchbase Server does not auto-failover a node that contains an unreplicated ephemeral bucket’s vBuckets. If one of these nodes becomes unresponsive, you must manually fail over the node.

Enable this setting when preserving the data in the ephemeral bucket is not critical for your application. For example, suppose you use the unreplicated ephemeral bucket for caching data. In this case, consider enabling this setting to allow Couchbase Server to auto-failover nodes containing its vBuckets. Losing the data in the cache is not critical, because your application can repopulate the cache with minimal performance cost.

If the data in the ephemeral bucket is critical for your application, enable one or more replicas for it. See Ephemeral Bucket Settings for more information about adding replicas for an ephemeral bucket.

If the unreplicated ephemeral bucket is indexed, Couchbase Server rebuilds the index after it auto-fails over the node even if the index is not on the failed node. After this type of failover, the index must be rebuilt because it indexes data lost in the failed node’s vBuckets. For more information, see Index Rollback After Failover.

See Node Availability to learn how to change these settings via the Couchbase Server Web Console. See Retrieving Auto-Failover Settings for information about changing these settings via the REST API.

Service-Specific Auto-Failover Policy

When a monitored or configured auto-failover event occurs on a node, there are constraints that need to be checked to determine if the node can be automatically failed-over. An example of such an event is the node cluster manager not being responsive. In such instances, one of the constraints is the policies or rules specific to the services that are running on the unresponsive node. Since a number of different service configurations are possible, below is information about the auto-failover policy for Couchbase Services, followed by specific examples.

The auto-failover policy for Couchbase Services is as follows:

A service must be running on a minimum number of nodes, for auto-failover to be applied to any one of those nodes, should that node become unresponsive.
The required minimum number of nodes is service-specific.
If the Data Service is running on its required minimum number of nodes, auto-failover may be applied to any of those nodes, even when auto-failover policy is thereby violated for one or more other, co-hosted services. This is referred to as Data Service Preference.
The index service shares the same Auto-Failover settings of the Data Service.
When the Index service is co-located with the Data service, it will not be consulted on failing over the node.

The node-minimum for each service is provided in the following table:

Service	Nodes Required
Data	2
Query	2
Index	2
Search	2
Analytics	2
Eventing	2

The figures in the Nodes Required column indicate the minimum number of nodes on which the corresponding service must be running for auto-failover to be triggered, unless Data Service Preference applies.

This is illustrated by the following examples:

A cluster has the following five nodes:

Node Services Hosted

#1

Data

#2

Data

#3

Search & Query

#4

Search & Query

#5

Analytics & Query

If node #4 becomes unresponsive, auto-failover can be triggered, since prior to unavailability, the Search Service was on two nodes (#3 and #4), and the Query Service on three (#3, #4, and #5): both of which figures meet the auto-failover policy requirement (for each of those services, 2).

However, if instead, node #5 becomes unresponsive, auto-failover is not triggered; since prior to unavailability, the Analytics Service was running only on one node (#5), which is below the auto-failover policy requirement for the Analytics Service (which is 2).

Node	Services Hosted
#1	Data
#2	Data
#3	Search & Query
#4	Search & Query
#5	Analytics & Query

A cluster has the following two nodes:

Node Services Hosted

#1

Data, Query, & Search

#2

Data

#3

Arbiter Node, no services

In this case, even though the Query and Search Services were both running on only a single node (#1), which is below the auto-failover policy requirement for each of those services (2), the Data Service was running on two nodes (#1 and #2), which meets the auto-failover policy requirement for the Data Service (2).

Node	Services Hosted
#1	Data, Query, & Search
#2	Data
#3	Arbiter Node, no services

A cluster has the following four nodes:

Node Services Hosted

#1

Data & Query

#2

Data, Index, & Query

#3

Data & Search

#4

Index

If node #1, #2, or #3 becomes unresponsive, auto-failover can be triggered. In each case, this is due to Data Service Preference, which applies auto-failover based on the policy for the Data Service, irrespective of other services on the unresponsive node.

Note that in the case of node #2, this allows an Index Service node to be automatically failed over.

If node #4 becomes unresponsive, then the auto-failover of the node will be triggered, as the index service supports Auto-Failover, independent of the Data service. Please note, in this scenario, the Index Service Auto-Failover will only happen if the action doesn’t result in any indexes or partition loss.

If an index does not have a replica and is co-located on a Data Service node that is failed over, then the index will be lost.

Node	Services Hosted
#1	Data & Query
#2	Data, Index, & Query
#3	Data & Search
#4	Index

Configuring Auto-Failover

Auto-failover is configured by means of parameters that include the following.

Timeout. The number of seconds that must elapse, after a node or group has become unresponsive, before auto-failover is triggered. This number is configurable. The default is 120 seconds; the minimum permitted duration is 1 second when set through the REST API and 5 seconds when set from the UI. The maximum is 3600 seconds. Note that a low number reduces the potential time-period during which a consistently unresponsive node remains unresponsive before auto-failover is triggered; but may also result in auto-failover being unnecessarily triggered, in consequence of short, intermittent periods of node unavailability.

Note that the monitoring for the Data Service and Index Service health for auto-failover uses the same Timeout value set for node unresponsiveness. For example, if the Index Service is deemed unhealthy (because of Index Service failure to send heartbeats) for the Timeout amount of time, then the node that the Index Service is on will be considered for auto-failover (despite the fact that the node cluster manager may be responding and sending heartbeats)

Maximum count. The maximum number of nodes that can fail (either concurrently or sequentially in one or more events) and be handled by auto-failover. The maximum value can be up to configured number of nodes, the default is 1. This parameter is available in Enterprise Edition only: in Community Edition, the maximum number of nodes that can fail and be handled by auto-failover is always 1.
Count. The number of nodes that have already failed over. The default value is 0. The value is incremented by 1 for every node that has an automatic-failover that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0. Run a rebalance to reset the count value back to 0.
Enablement of disk-related automatic failover; with corresponding time-period. Whether automatic failover is enabled to handle continuous read-write failures. If it is enabled, a number of seconds can also be specified: this is the length of a constantly recurring time-period against which failure-continuity on a particular node is evaluated. The default is 120 seconds, the minimum permitted is 1 second and the maximum is 3600 seconds. Automatic failover is triggered if at least 60% of the most recently elapsed instance of the time-period consists of disk-related continuous failure plus the standard auto failover timeout.

By default, auto-failover is switched on, to occur after 120 seconds for up to 1 event. Nevertheless, Couchbase Server triggers auto-failover only within the constraints described above, in Auto-Failover Constraints.

For practical steps towards auto-failover configuration, see the documentation provided for specifying General settings with Couchbase Web Console UI, for Managing Auto-Failover with the REST API, and setting-autofailover with the CLI.

Auto-Failover During Rebalance

If an auto-failover event occurs during a rebalance, the rebalance is stopped; then, auto-failover is triggered.

Following an auto-failover, rebalance is not automatically re-attempted.

At this point, the cluster is likely to be in an unbalanced state; therefore, a rebalance should be performed manually, and the unresponsive node fixed and restored to the cluster, as appropriate.

Auto-Failover and Durability

Couchbase Server provides durability, which ensures the greatest likelihood of data-writes surviving unexpected anomalies, such as node-outages. The auto-failover maximum should be established to support guarantees of durability. See Durability, for information.