You are viewing the documentation for a prerelease version.

View Latest

Automatic Failover

A node or group can be failed over automatically when it either becomes unresponsive or experiences continuous disk-access problems.

Understanding Automatic Failover

Automatic Failover — or auto-failover — can be configured to fail over a node or group automatically: no immediate administrator intervention is required. Specifically, the Cluster Manager autonomously detects and verifies that the node or group is unresponsive, and then initiates the hard failover process. Auto-failover does not fix or identify problems that may have occurred. Once appropriate fixes have been applied to the cluster by the administrator, a rebalance is required. Auto-failover is always hard failover. For information on how services are affected by hard failover, see Hard Failover.

This page describes auto-failover concepts and policy. For information on managing auto-failover, see the information provided for Couchbase Web Console at General (which provides information on general cluster-settings), for the REST API at Managing Auto-Failover, and for the CLI at setting-autofailover.

Failover Events

Auto-failover occurs in response to failover events. Events are of three kinds:

  • Node failure. A server-node within the cluster is unresponsive (due to a network failure, out-of-memory problem, or other node-specific issue).

  • Disk read/write failure. Attempts to read from or write to disk on a particular node have resulted in a significant rate of failure, for longer than a specified time-period. The node is removed by auto-failover, even though the node continues to be contactable.

  • Group failure. An administrator-defined group of server-nodes within the cluster is unresponsive (perhaps due to a network or power failure that has affected an individual, physical rack of machines, or a specific subnet).

Auto-Failover Constraints

Auto-failover is triggered:

  • Only on the occurrence of one event at a time. If multiple events are concurrent, auto-failover is not triggered.

  • Sequentially, only up to an administrator-specified maximum number of events. The highest permitted maximum is 3. After this maximum number of auto-failovers has been reached, no further auto-failover occurs, until the count is manually reset by the administrator. Note, however, that the count can be manually reset prior to the maximum number being reached.

  • In no circumstances where data-loss might result: for example, when a bucket has no replicas. Therefore, even a single event may not be responded to; and an administrator-specified maximum number of events may not be reached.

  • Only in accordance with the Service-Specific Auto-Failover Policy for the service or services on the unresponsive node.

  • For nodes, only when a majority of nodes can still be contacted, following a node’s becoming unresponsive.

  • For groups, only when a majority of groups can still be contacted, following a group’s becoming unresponsive.

Note that auto-failover should be configured only when the cluster contains sufficient resources to handle all possible consequences: workload-intensity on remaining nodes may increase significantly.

Auto-failover is for intra-cluster use only: it does not work with Cross Data Center Replication (XDCR).

Auto-failover may take significantly longer if the unresponsive node is that on which the orchestrator is running; since time-outs must occur, before available nodes can elect a new orchestrator-node and thereby continue.

See Email Alerts, for details on configuring email alerts related to failover.

See Server Group Awareness, for information on server groups and group failover.

Service-Specific Auto-Failover Policy

The auto-failover policy for Couchbase Services is as follows:

  • A service must be running on a minimum number of nodes, for auto-failover to be applied to any one of those nodes, should that node become unresponsive.

  • The required minimum number of nodes is service-specific.

  • If the Data Service is running on its required minimum number of nodes, auto-failover may be applied to any of those nodes, even when auto-failover policy is thereby violated for one or more other, co-hosted services. This is referred to as Data Service Preference.

  • A node running the Index Service is never automatically failed over, unless Data Service Preference applies.

The node-minimum for each service is provided in the following table:

Service Nodes Required

Data

3

Query

2

Index

Not supported

Search

2

Analytics

2

Eventing

2

The figures in the Nodes Required column indicate the minimum number of nodes on which the corresponding service must be running for auto-failover to be triggered, unless Data Service Preference applies.

This is illustrated by the following examples:

  • A cluster has the following five nodes:

    Node Services Hosted

    #1

    Data

    #2

    Data

    #3

    Search & Query

    #4

    Search & Query

    #5

    Analytics & Query

    If node #4 becomes unresponsive, auto-failover can be triggered, since prior to unavailability, the Search Service was on two nodes (#3 and #4), and the Query Service on three (#3, #4, and #5): both of which figures meet the auto-failover policy requirement (for each of those services, 2).

    However, if instead, node #5 becomes unresponsive, auto-failover is not triggered; since prior to unavailability, the Analytics Service was running only on one node (#5), which is below the auto-failover policy requirement for the Analytics Service (which is 2).

  • A cluster has the following three nodes:

    Node Services Hosted

    #1

    Data, Query, & Search

    #2

    Data

    #3

    Data

    If node #1 becomes unresponsive, auto-failover can be triggered. This is due to Data Service Preference, which applies auto-failover based on the policy for the Data Service, irrespective of other services on the unresponsive node. In this case, even though the Query and Search Services were both running on only a single node (#1), which is below the auto-failover policy requirement for each of those services (2), the Data Service was running on three nodes (#1, #2, and #3), which meets the auto-failover policy requirement for the Data Service (3).

  • A cluster has the following four nodes:

    Node Services Hosted

    #1

    Data & Query

    #2

    Data, Index, & Query

    #3

    Data & Search

    #4

    Index

    If node #1, #2, or #3 becomes unresponsive, auto-failover can be triggered. In each case, this is due to Data Service Preference, which applies auto-failover based on the policy for the Data Service, irrespective of other services on the unresponsive node. Note that in the case of node #2, this allows an Index Service node to be automatically failed over. However, if node #4 becomes unresponsive, auto-failover is not triggered; since auto-failover is not supported for the Index Service, unless Data Service Preference applies.

Configuring Auto-Failover

Auto-failover is configured by means of parameters that include the following.

  • Timeout. The number of seconds that must elapse, after a node or group has become unresponsive, before auto-failover is triggered. This number is configurable: the default is 120 seconds; the minimum permitted is 5; the maximum 3600. Note that a low number reduces the potential time-period during which a consistently unresponsive node remains unresponsive before auto-failover is triggered; but may also result in auto-failover being unnecessarily triggered, in consequence of short, intermittent periods of node unavailability.

  • Maximum count. The maximum number of failover events that can occur sequentially and be handled by auto-failover. The maximum-allowed value is 3, the default is 1. This parameter is available in Enterprise Edition only: in Community Edition, the maximum number of failover events that can occur sequentially and be handled by auto-failover is always 1.

  • Count. The number of failover events that have occurred. The default value is 0. The value is incremented by 1 for every automatic-failover event that occurs, up to the defined maximum count: beyond this point, no further automatic failover can be triggered until the count is reset to 0 through administrator-intervention.

  • Enablement of disk-related automatic failover; with corresponding time-period. Whether automatic failover is enabled to handle continuous read-write failures. If it is enabled, a number of seconds can also be specified: this is the length of a constantly recurring time-period against which failure-continuity on a particular node is evaluated. The default for this number of seconds is 120; the minimum permitted is 5; the maximum 3600. If at least 60% of the most recently elapsed instance of the time-period has consisted of continuous failure, failover is automatically triggered. The default value for the enablement of disk-related automatic failover is false. This parameter is available in Enterprise Edition only.

  • Group failover enablement. Whether or not groups should be failed over. A group failover is considered to be a single event, even if many nodes are included in the group. The default value is false. This parameter is available in Enterprise Edition only.

By default, auto-failover is switched on, to occur after 120 seconds for up to 1 event. Nevertheless, Couchbase Server triggers auto-failover only within the constraints described above, in Auto-Failover Constraints.

For practical steps towards auto-failover configuration, see the documentation provided for specifying General settings with Couchbase Web Console UI, for Managing Auto-Failover with the REST API, and setting-autofailover with the CLI.

Auto-Failover During Rebalance

Couchbase Server provides a setting to determine whether, once enabled, auto-failover should specifically be triggered during Rebalance, in the event of a node becoming unresponsive.

If auto-failover has been set to be triggered, following the configured timeout period, the rebalance is stopped; then, auto-failover is duly triggered. Following auto-failover, rebalance is not automatically re-attempted. At this point, the cluster is likely to be in an unbalanced state: therefore, rebalance should be performed manually; and the unresponsive node fixed and restored to the cluster, as appropriate.

If auto-failover has not been set to be triggered, unless there is manual intervention, no failover occurs. This may cause the rebalance to hang for an indeterminate period; before failing, with error messages.

For information on setting auto-failover in the context of rebalance, see the information on General settings.

Auto-Failover and Durability

Couchbase Server provides durability, which ensures the greatest likelihood of data-writes surviving unexpected anomalies, such as node-outages. The auto-failover maximum should be established to support guarantees of durability. See Durability, for information.