A newer version of this documentation is available.

View Latest

Automatic Failover

      A node or group can be failed over automatically, when it either becomes unavailable or experiences continuous disk-access problems.

      Understanding Automatic Failover

      Automatic Failover — or auto-failover — can be configured to fail over a node or group automatically: no immediate administrator intervention is required. Specifically, the Cluster Manager autonomously detects and verifies that the node or group is unavailable, and then initiates the hard failover process. Auto-failover does not fix or identify problems that may have occurred. Once appropriate fixes have been applied to the cluster by the administrator, a rebalance is required. Auto-failover is always hard failover. For information on how services are affected by hard failover, see Hard Failover.

      This page describes auto-failover concepts and architecture. For information on managing auto-failover, see the information provided for Couchbase Web Console at Node Availability, for the REST API at Managing Auto-Failover, and for the CLI at setting-autofailover.

      Failover Events

      Auto-failover occurs in response to failover events. Events are of three kinds:

      • Node failure. A server-node within the cluster is unavailable (due to a network failure, out-of-memory problem, or other node-specific issue).

      • Disk read/write failure. Attempts to read from or write to disk on a particular node have resulted in a significant rate of failure, for longer than a specified time-period. The node is removed by auto-failover, even though the node continues to be contactable.

      • Group failure. An administrator-defined group of server-nodes within the cluster is unavailable (perhaps due to a network or power failure that has affected an individual, physical rack of machines, or a specific subnet).

      Auto-Failover Constraints

      Couchbase Server applies the following constraints to auto-failover’s handling of events:

      • Auto-failover is triggered only on the occurrence of one event at a time. If multiple events occur concurrently, or if a second event follows the first prior to auto-failover’s completion, auto-failover is not triggered.

      • Auto-failover may be triggered sequentially up to an administrator-specified maximum number of events. The highest permitted maximum is 3. After this maximum number of auto-failovers has been triggered, no further auto-failover occurs, until the count is manually reset by the administrator.

      • Auto-failover is never triggered by Couchbase Server when data-loss might result: for example, when a bucket has no replicas. Therefore, even a single event may not be responded to; and an administrator-specified maximum number of events may not be reached.

      • Auto-failover is never triggered when the resulting number of available nodes within the cluster would not constitute a majority of the number available before the problem-occurrence. For example, if one of three nodes or groups were to go offline, auto-failover could be triggered by the cluster of two remaining; but if one of two nodes or groups were to go offline, auto-failover could not be triggered. (This prevents the situation where functioning nodes or groups, finding they cannot intercommunicate, fail each other over, and then attempt to continue independently of one another, each representing itself as the cluster.)

      Auto-failover can be triggered only when the cluster contains a minimum of three nodes; and should be configured for groups only when a minimum of three groups have been defined. Additionally, auto-failover should be configured only when the cluster contains sufficient resources to handle all possible results: workload-intensity on remaining nodes may increase significantly. Auto-failover is for intra-cluster use only: it does not work with Cross Data Center Replication (XDCR).

      Auto-failover may take significantly longer if the non-available node is that on which the orchestrator is running, since time-outs must occur before available nodes can elect a new orchestrator-node and thereby continue.

      See Email Alerts, for details on configuring email alerts related to failover.

      See Server Group Awareness, for information on server groups and group failover.

      Configuring Auto-Failover

      Auto-failover is configured by means of parameters that include the following.

      • Timeout. The number of seconds that must elapse, after a node or group has become unavailable, before auto-failover is triggered. This number is configurable: the default is 120 seconds; the minimum permitted is 5; the maximum 3600. Note that a low number reduces the potential time-period during which a consistently unavailable node remains unavailable before auto-failover is triggered; but may also result in auto-failover being unnecessarily triggered, in consequence of short, intermittent periods of node unavailability.

      • Maximum count. The maximum number of failover events that can occur sequentially and be handled by auto-failover. The maximum-allowed value is 3, the default is 1. This parameter is available in Enterprise Edition only: in Community Edition, the maximum number of failover events that can occur sequentially and be handled by auto-failover is always 1.

      • Count. The number of failover events that have occurred. The default value is 0. The value is incremented by 1 for every automatic-failover event that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0 through administrator-intervention.

      • Enablement of disk-related automatic failover; with corresponding time-period. Whether automatic failover is enabled to handle continuous read-write failures. If it is enabled, a number of seconds can also be specified: this is the length of a constantly recurring time-period against which failure-continuity on a particular node is evaluated. The default for this number of seconds is 120; the minimum permitted is 5; the maximum 3600. If at least 60% of the most recently elapsed instance of the time-period has consisted of continuous failure, failover is automatically triggered. The default value for the enablement of disk-related automatic failover is false. This parameter is available in Enterprise Edition only.

      • Group failover enablement. Whether or not groups should be failed over. A group failover is considered to be a single event, even if many nodes are included in the group. The default value is false. This parameter is available in Enterprise Edition only.

      By default, auto-failover is switched on, to occur after 120 seconds for up to 1 event. Nevertheless, Couchbase Server triggers auto-failover only within the constraints described above, in Auto-Failover Constraints. These include a minimum of three nodes being in the cluster.

      For more detailed information, see the documentation provided for specifying Node Availability with Couchbase Web Console UI, for Managing Auto-Failover with the REST API, and setting-autofailover with the CLI.