A newer version of this documentation is available.

View Latest

Error handling

Errors are returned by the SDK if a specific operation could not be executed. Sometimes errors are received directly from the server while other times they are generated internally by the SDK.

Errors are divided into several categories based on their cause and the ways to handle them.

Data errors

Data errors are errors returned by the data store because a certain data condition was not met. Data errors typically have very clear corrective paths.

Item not found

If an item is not found, then it has either not yet been created or has since been deleted. It is received on retrieval (get) operations (get an item), replace operations (replace an item that already exists), and remove operations (delete an existing item).

If this error is received when attempting to retrieve an item, then the item should either be created (if possible) or return an error to the user.

If this error is received when replacing an item, then it indicates an issue in the application state (perhaps you can raise an exception up the stack). If you do not care that the item exists, the upsert() method may be used instead.

If receiving this error when removing an item, it may safely be ignored: not-found on remove essentially means the item is already removed.

The Not Found error is returned by the server.

Item already exists

The insert operation require that the item not yet exist; it is intended to create a new unique record (think about inserting a "new user ID"). This error is returned by the server when an item already exists. Applications at this point should probably return an error up the stack to the user (when applicable); for example indicating that a new account could not be registered with the given user name, since it already exists.

CAS mismatch

A CAS mismatch error is returned when an operation was executed with a CAS value (supplied by the application) and the CAS value passed differs from the CAS value on the server. The corrective course of action in this case is for the application to re-try the read-update cycle as explained in detail in the CAS documentation.

Transient and resource errors

These errors may be received because of resource starvation.

Temporary failure

This item is received during mutations when the cluster node has run out of memory. Its disk and replication queues are full and must wait until items in those queues are stored and replicated before it can begin receiving new operations.

While this condition is rare, it may happen under massively concurrent writes from clients and a limited memory allocation on the server.

The short-term corrective action for this error is to throttle and slow down the application, giving Couchbase some time for pending operations to complete before issuing new operations. The long term corrective action is to increase memory capacity on the cluster, either by adding more RAM to each node, or by adding more nodes.

The temporary failure error is returned from the server.

Timeouts

While technically a transient error, discussion of timeouts and their handling warrants a special section.

Timeouts are returned by the client when an operation is waiting too long for an acknowledgment from the server. The length of time a client waits for an a response from the server is configurable (see your language’s Couchbase SDK reference for how to configure it).

Timeouts are caused by an unresponsive network or software system. Possible causes are:

  1. Congested network.

  2. Heavy CPU load on the server.

  3. Heavy CPU load on the client.

  4. Physical disruption between client and server

Applications should generally not retry an operation if it resulted in a timeout before performing corrective action and/or analysis to determine the cause of these timeouts.

Timeout does not mean the operation failed.
It is important to note that receipt of a timeout error does not mean the operation did not complete. It means the server did not respond in a timely fashion. It is possible that the operation completed on the server, but the acknowledgment response was not received by the client. If the CAS of the item is known, you may use it to guard against performing the same operation twice.

Network errors

Network errors are returned by the client if it cannot establish a network connection to the server, or if an existing network connection was terminated. The cause of a network error may be included in the error object itself (for example, couldn’t resolve host name, connection refused, no route to host).

Like timeout errors, network errors may be transient (indicative of a bad network connection). They may also be a result of a node being failed over.

Network error does not indicate the failure of the operation
As in timeout errors], it is not possible to determine if an operation was actually completed if it resulted in a network error.

For cases where it is of the utmost importance to retrieve the item, a read from a replica can be performed. This will retrieve a potentially stale item.

Missing nodes

If a cluster is in a degraded state where one or more nodes are failed over, and no replicas remain to be promoted, then access to some items will be unavailable because there is no online cluster node which can offer access to them.

Operations which have failed with a Missing Node error will only succeed once the failed-over node is re-added to the cluster, or the cluster is rebalanced.

Preempting network errors and missing nodes

Some SDKs offer access to internal vBucket APIs which can be used by your application to determine the node that the SDK will contact to perform a given operation on a document. If your application contains an existing monitoring infrastructure, you may check to see if the node is detected as being unavailable, and take preemptive action avoiding potential timeouts.

Reading from replicas

High-availability applications can read documents from replicas, exchanging consistency for availability.

If your bucket is configured for replication, then multiple replicas of each item exist within the cluster. By default the client will attempt to access an item using its computed master or active node. This returns the current and authoritative version of the item as it is stored within Couchbase.

In conditions where access to the active node is unavailable (for example, it is disconnected from the network), an application may be able to access a replica version of the item using the getFromReplica API. This operation queries a replica node for a copy of the item. The item returned by be an older version: it is possible a newer version exists in the active node, but did not manage to get replicated before the active node went offline.

The getFromReplica operation is available in SDKs, either as a discreet API call, or as an option to a get command.

try:
  result = cb.get("docid")
except CouchbaseNetworkError as e:
  print "Got error. Fetching from replica!"
  result = cb.get(‘docid’, replica=True)