Handling Errors

    +
    Errors, handling them from the Python SDK.

    Errors are inevitable. The developer’s job is to be prepared for whatever is likely to come up — and to try and be prepared for anything that conceivably could come up. Couchbase gives you a lot of flexibility, but it is recommended that you equip yourself with an understanding of the possibilities.

    How the SDK Handles Errors

    Couchbase-specific exceptions are all derived from CouchbaseException. Errors that cannot be recovered by the SDK will be returned to the application. These unrecoverable errors are left to the application developer to handle — this section covers handling many of the common error scenarios.

    Handling Errors

    The approach will depend upon the type of error thrown. Is it transient? Is it even recoverable? Below we examine error handling strategies in relation to the Couchbase SDKs, then take a practical walk through some common error scenarios you are likely to have to handle when working with a Couchbase cluster.

    Failing

    While most of the time you want more sophisticated error handling strategies, sometimes you just need to fail. It makes no sense for some errors to be retried, either because they are not transient, or because you already tried everything to make it work and it still keeps failing. If containment is not able to handle the error, then it needs to propagate up to a parent component that can handle it.

    For synchronous programs, every error is converted into an Exception and thrown so that you can use regular try/catch semantics.

    If you do not catch the Exception, it will bubble up:

    Exception in thread "main" java.lang.RuntimeException: java.lang.Exception: I'm failing
     at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:482)
     at rx.observables.BlockingObservable.single(BlockingObservable.java:349)

    Logging

    It is always important to log errors, but even more so in the case of reactive applications. Because of the event driven nature, stack traces get harder to look at, and caller context is sometimes lost.

    Retry

    Transient errors — such as those caused by resource starvation — are best tackled with one of the following retry strategies:

    Retry immediately. Retry with a fixed delay. Retry with a linearly increasing delay. Retry with an exponentially increasing delay. Retry with a random delay.

    # This an example of error handling for idempotent operations (such as the full-doc op seen here).
    
    def change_email(collection,  # type: Collection
                     maxRetries  # type: int
                     ):
        try:
            result = collection.get("doc_id")  # type: GetResult
    
            if not result:
                raise couchbase.exceptions.KeyNotFoundException()
            else:
                content = result.content
    
            content["email"]="john.smith@couchbase.com"
    
            collection.replace("doc_id", content);
        except couchbase.exceptions.CouchbaseError as err:
            # isRetryable will be true for transient errors, such as a CAS mismatch (indicating
            # another agent concurrently modified the document), or a temporary failure (indicating
            # the server is temporarily unavailable or overloaded).  The operation may or may not
            # have been written, but since it is idempotent we can simply retry it.
            if err.is_retryable:
                if maxRetries > 0:
                    logging.info("Retrying operation on retryable err " + err)
                change_email(collection, maxRetries - 1);
            else:
                # Errors can be transient but still exceed our SLA.
                logging.error("Too many attempts, aborting on err " + err)
                raise
    
            # If the err is not isRetryable, there is perhaps a more permanent or serious error,
            # such as a network failure.
            else:
                logging.error("Aborting operation on err " + err);
                raise
    
    
    MAX_RETRIES=5
    try:
        change_email(collection, MAX_RETRIES)
    except RuntimeError as err:
        # What to do here is highly application dependent.  Options could include:
        # - Returning a "please try again later" error back to the end-user (if any)
        # - Logging it for manual human review, and possible follow-up with the end-user (if any)
        logging.error("Failed to change email")

    Fallback

    Instead of (or in addition to) retrying, another valid option is falling back to either a different Observable, or to a default value.

    KV

    The KV Service exposes several common errors that can be encountered - both during development, and to be handled by the production app. Here we will cover some of the most common errors.

    Doc does not exist

    If a particular key cannot be found a KeyNotFoundException is raised:

    
    try:
        collection.replace("my-key", {})
    except KeyNotFoundException:
        # key does not exist
        pass

    Doc already exists

    On the other hand if the key already exists and should not (e.g. on an insert) then a KeyExistsException is raised:

    try:
        collection.insert("my-key",{})
    except KeyExistsException:
        # key already exists
        pass

    Doc too large

    RequestTooBigException

    CAS Mismatch

    Couchbase provides optimistic concurrency using CAS. Each document gets a CAS value on the server, which is changed on each mutation. When you get a document you automatically receive its CAS value, and when replacing the document, if you provide that CAS the server can check that the document has not been concurrently modified by another agent in-between. If it has, it returns ErrCasMismatch, and the most appropriate response is to simply retry it:

    try:
        result = collection.get("my-key")
        collection.replace("my-key", {}, cas = result.cas)
    except couchbase.exceptions.CASMismatchException:
        # the CAS value has changed
        pass

    Durability ambiguous

    There are situations with any distributed system in which it is simply impossible to know for sure if the operation completed successfully or not. Take this as an example: your application requests that a new document be created on Couchbase Server. This completes, but, just before the server can notify the client that it was successful, a network switch dies and the application’s connection to the server is lost. The client will timeout waiting for a response and will raise a TimeoutException, but it’s ambiguous to the app whether the operation succeeded or not.

    So ErrTimeout is one ambiguous error, another is ErrDurabilityAmbiguous, which can returned when performing a durable operation. This similarly indicates that the operation may or may not have succeeded: though when using durability you are guaranteed that the operation will either have been applied to all replicas, or none.

    Given the inevitability of ambiguity, how is the application supposed to handle this?

    It really needs to be considered case-by-case, but the general strategy is to become certain if the operation succeeded or not, and to retry it if required.

    For instance, for inserts, they can simply be retried to see if they fail on ErrDocumentExists, in which case the operation was successful:

    try:
        collection.upsert("my-key", {}, durability_level=Durability.PERSIST_TO_MAJORITY)
    except couchbase.exceptions.DurabilitySyncWriteAmbiguousException:
        # durable write request has not completed, it is unknown whether the request met the durability requirements or not
        pass

    Durability invalid level

    try:
        collection.upsert("my-key", {}, durability_level=Durability.PERSIST_TO_MAJORITY)
    except couchbase.exceptions.DurabilityInvalidLevelException:
        # cluster not able to meet durability requirements
        pass

    No Cluster replicas configured

    try:
        collection.upsert("my-key", {}, persist_to=PersistTo.FOUR, ReplicateTo.THREE)
    except couchbase.exceptions.ReplicaNotConfiguredException:
        # cluster doesn't have replicas configured
        pass

    Replicate to / persist to greater than replica count

    try:
        collection.upsert("my-key", {}, persist_to=PersistTo.FOUR, replicate_to=ReplicateTo.THREE)
    except couchbase.exceptions.DurabilityImpossibleException:
        # cluster not able to meet durability requirements
        pass

    Timeout with replicate to / persist to requirements

    try:
        collection.upsert("my-key", {}, persist_to=PersistTo.FOUR, replicate_to=ReplicateTo.THREE)
    except TimeoutException:
        # document may or may not have persisted to specified durability requirements
        pass

    Query and Analytics Errors

    N1ql and Analytics either return results or an error. If there is an error then it exposed in the following way(s)…​

    Search and View Errors

    Unlike N1ql and Analytics, Search and Views can return multiple errors as well as errors and partial results.

    Connections…​

    Networks, remotely-located clusters, and XDCR all offer opportunities for packets to go astray, or resources to go offline or become temporarily unavailable.

    Authentication

    RBAC Roles - permissions on Service / Bucket / etc.

    Standard since Couchbase Data Platform 5.0, xref:[Role Based Access Control (RBAC)] gives fine-grained permissions designed to protect the security of and access to data with a range of user roles encompassing different privileges. Refer to our Authorization pages for a fuller understanding.

    The developer must match an application’s need for data access with the necessary permissions for access.

    If you are using Couchbase Community Edition, the only roles available are Bucket Full Access, Admin, and Read-only Admin.

    Additional Resources

    Errors & Exception handling is an expansive topic. Here, we have covered examples of the kinds of exception scenarios that you are most likely to face. More fundamentally, you also need to weigh up concepts of durability.

    Diagnostic methods are available to check on the health if the cluster, and the health of the network.

    Logging methods are dependent upon the platform and SDK used. We offer recommendations and practical examples.