Error Handling

Handling transaction errors with Couchbase.

Couchbase transactions will attempt to resolve many errors for you, through a combination of retrying individual operations and the application’s function literal. This includes some transient server errors, and conflicts with other transactions.

Transaction Errors

There can be situations where total failure is indicated to the application via errors. These situations include:

Any error thrown by a transaction function literal, either deliberately or through an application logic bug.
Attempting to insert a document that already exists.
Calling ctx.Get() on a document key that does not exist (if the resultant error is not caught).

Once one of these errors occurs, the current attempt is irrevocably failed (though the transaction may retry the function literal to make a new attempt). It is not possible for the application to catch the failure and continue (with the exception of ctx.Get() raising an error). Once a failure has occurred, all other operations tried in this attempt (including commit) will instantly fail.

Transactions, as they are multi-stage and multi-document, also have a concept of partial success or failure. This is signalled to the application through the TransactionResult.UnstagingComplete field, described later.

There are three errors that transactions can raise to an application:

TransactionFailedError
TransactionExpiredError
TransactionCommitAmbiguousError

TransactionFailedError and TransactionExpiredError

The transaction definitely did not reach the commit point. TransactionFailedError indicates a fast-failure whereas TransactionExpiredError indicates that retries were made until the timeout was reached, but this distinction is not normally important to an application and generally TransactionExpiredError does not need to be handled individually.

Either way, an attempt will have been made to rollback all changes. This attempt may or may not have been successful, but the results of this will have no impact on the protocol or other actors. No changes from the transaction will be visible, both to transactional and non-transactional actors.

Handling

Generally, debugging exactly why a given transaction failed requires review of the logs, so it is suggested that the application log these on failure. The application may want to try the transaction again later. Alternatively, if transaction completion time is not a priority, then transaction timeouts (which default to 15 seconds) can be extended across the board through TransactionsConfig.

cluster, err := gocb.Connect("localhost", gocb.ClusterOptions{
	TransactionsConfig: gocb.TransactionsConfig{
		Timeout: 120 * time.Second,
	},
})

This will allow the protocol more time to get past any transient failures (for example, those caused by a cluster rebalance). The tradeoff to consider with longer timeouts, is that documents that have been staged by a transaction are effectively locked from modification from other transactions, until the timeout has been reached.

Note that the timeout is not guaranteed to be followed precisely. For example, if the application were to do a long blocking operation inside the function literal (which should be avoided), then timeout can only trigger after this finishes. Similarly, if the transaction attempts a key-value operation close to the timeout, and that key-value operation times out, then the transaction timeout may be exceeded.

TransactionCommitAmbiguousError

Each transaction has a 'single point of truth' that is updated atomically to reflect whether it is committed.

However, it is not always possible for the protocol to become 100% certain that the operation was successful, before the transaction expires. This potential ambiguity is unavoidable in any distributed system; a classic example is a network failure happening just after an operation was sent from a client to a server. The client will not get a response back and cannot know if the server received and executed the operation.

The ambiguity is particularly important at the point of the atomic commit, as the transaction may or may not have reached the commit point. Couchbase transactions will raise TransactionCommitAmbiguousError to indicate this state. It should be rare to receive this error.

If the transaction had in fact successfully reached the commit point, then the transaction will be fully completed ("unstaged") by the asynchronous cleanup process at some point in the future. With default settings this will usually be within a minute, but whatever underlying fault has caused the TransactionCommitAmbiguousError may lead to it taking longer.

If the transaction had not in fact reached the commit point, then the asynchronous cleanup process will instead attempt to roll it back at some point in the future.

Handling

This error can be challenging for an application to handle. As with TransactionFailedError it is recommended that it at least writes any logs from the transaction, for future debugging. It may wish to retry the transaction at a later point, or extend transactional timeouts (as detailed above) to give the protocol additional time to resolve the ambiguity.

TransactionResult.UnstagingComplete

This boolean flag indicates whether all documents were able to be unstaged (committed).

For most use-cases it is not an issue if it is false. All transactional actors will still read all the changes from this transaction, as though it had committed fully. The cleanup process is asynchronously working to complete the commit, so that it will be fully visible to non-transactional actors.

The flag is provided for those rare use-cases where the application requires the commit to be fully visible to non-transactional actors, before it may continue. In this situation the application can raise an error here, or poll all documents involved until they reflect the mutations.

If you regularly see this flag false, consider increasing the transaction timeout to reduce the possibility that the transaction times out during the commit.

Full Error Handling Example

Pulling all of the above together, this is the suggested best practice for error handling:

result, err := cluster.Transactions().Run(func(ctx *gocb.TransactionAttemptContext) error {
	// ... transactional code here ...
	return nil
}, nil)
var ambigErr gocb.TransactionCommitAmbiguousError
if errors.As(err, &ambigErr) {
	fmt.Println("Transaction returned TransactionCommitAmbiguous and may have succeeded")

	// Of course, the application will want to use its own logging rather
	// than fmt.Printf
	fmt.Printf("%+v", ambigErr)
	return
}
var transactionFailedErr gocb.TransactionFailedError
if errors.As(err, &transactionFailedErr) {
	// The transaction definitely did not reach commit point
	fmt.Println("Transaction failed with TransactionFailed")
	fmt.Printf("%+v", transactionFailedErr)
	return
}
if err != nil {
	panic(err)
}

// The transaction definitely reached the commit point. Unstaging
// the individual documents may or may not have completed
if !result.UnstagingComplete {
	// In rare cases, the application may require the commit to have
	// completed.  (Recall that the asynchronous cleanup process is
	// still working to complete the commit.)
	// The next step is application-dependent.
}