Tracing from the C SDK (libcouchbase) with Couchbase Server

Use tracing from the SDK to discover timeouts and bottlenecks across the network and cluster.

For our example, we will customize the threshold logging tracer settings with lcb_cntl. By default it will log every 10 seconds (if something is found) and only sample the top 10 slowest operations. The default threshold for key/value operation is 500 milliseconds. We shall set them so low that almost everything gets logged - not something you should do in production!

lcb_U32 val = 1;
lcb_cntl(instance, LCB_CNTL_SET, LCB_CNTL_TRACING_THRESHOLD_KV, &val); /* 1 microsecond */
lcb_cntl(instance, LCB_CNTL_SET, LCB_CNTL_TRACING_THRESHOLD_QUEUE_FLUSH_INTERVAL, &val); /* 1 second */

val = LCB_LOG_INFO; /* threshold logger writes at INFO level */
lcb_cntl(instance, LCB_CNTL_SET, LCB_CNTL_CONLOGGER_LEVEL, &val);

This will set our threshold for key/value operations to one microsecond, log the found operations every second.

With these configs in place we are ready to run some operations. Below, we read some documents from the travel-sample bucket and, if found, write them back with an upsert - giving us both read and write operations to log.

static void die(lcb_t instance, const char *msg, lcb_error_t err)
{
    fprintf(stderr, "%s. Received code 0x%X (%s)\n", msg, err, lcb_strerror(instance, err));
    exit(EXIT_FAILURE);
}

static void op_callback(lcb_t instance, int cbtype, const lcb_RESPBASE *rb)
{
    if (rb->rc == LCB_SUCCESS && cbtype == LCB_CALLBACK_GET) {
        lcb_error_t err;
        lcb_CMDSTORE cmd = {0};
        const lcb_RESPGET *rg = (const lcb_RESPGET *)rb;

        LCB_CMD_SET_KEY(&cmd, rg->key, rg->nkey);
        LCB_CMD_SET_VALUE(&cmd, rg->value, rg->nvalue);
        cmd.operation = LCB_SET;

        err = lcb_store3(instance, NULL, &cmd);
        if (err != LCB_SUCCESS) {
            die(instance, "Couldn't schedule storage operation", err);
        }
    }
}

int main(int argc, char *argv[])
{
    lcb_error_t err;
    lcb_t instance;
    struct lcb_create_st create_options = {0};

    create_options.version = 3;
    create_options.v.v3.connstr = "couchbase://127.0.0.1/travel-sample";
    create_options.v.v3.passwd = "password";
    create_options.v.v3.username = "Administrator";


    err = lcb_create(&instance, &create_options);
    if (err != LCB_SUCCESS) {
        die(NULL, "Couldn't create couchbase handle", err);
    }

    {
        lcb_U32 val = 1;
        lcb_cntl(instance, LCB_CNTL_SET, LCB_CNTL_TRACING_THRESHOLD_KV, &val); /* 1 microsecond */
        lcb_cntl(instance, LCB_CNTL_SET, LCB_CNTL_TRACING_THRESHOLD_QUEUE_FLUSH_INTERVAL, &val); /* 1 second */

        val = LCB_LOG_INFO; /* threshold logger writes at INFO level */
        lcb_cntl(instance, LCB_CNTL_SET, LCB_CNTL_CONLOGGER_LEVEL, &val);
    }

    err = lcb_connect(instance);
    if (err != LCB_SUCCESS) {
        die(instance, "Couldn't schedule connection", err);
    }

    lcb_wait(instance);

    err = lcb_get_bootstrap_status(instance);
    if (err != LCB_SUCCESS) {
        die(instance, "Couldn't bootstrap from cluster", err);
    }

    lcb_install_callback3(instance, LCB_CALLBACK_GET, op_callback);
    lcb_install_callback3(instance, LCB_CALLBACK_STORE, op_callback);

    {
        lcb_CMDGET cmd = {0};
        char key[] = "airline_1X";
        size_t key_len = strlen(key);
        int i;
        for (i = 0; i < 5; i++) {
            key[key_len - 1] = i + '0';
            LCB_CMD_SET_KEY(&cmd, key, key_len);
            err = lcb_get3(instance, NULL, &cmd);
            if (err != LCB_SUCCESS) {
                die(instance, "Couldn't schedule retrieval operation", err);
            }
        }
    }
    lcb_wait(instance);

    lcb_destroy(instance);
    return 0;
}

Run the code, and you will see something like the following in the logs:

22ms [I3fc6ae73] {27957} [INFO] (tracer - L:148) Operations over threshold: {"count":6,"service":"kv","top":[{"last_local_address":"192.168.1.10:56762","last_local_id":"000000003fc6ae73/277953824e30913b","last_operation_id":"get:0x0","last_remote_address":"192.168.1.12:11210","server_us":27,"total_us":1761},{"last_local_address":"192.168.1.10:56762","last_local_id":"000000003fc6ae73/277953824e30913b","last_operation_id":"get:0x1","last_remote_address":"192.168.1.12:11210","server_us":8,"total_us":2189},{"last_local_address":"192.168.1.10:56762","last_local_id":"000000003fc6ae73/277953824e30913b","last_operation_id":"get:0x2","last_remote_address":"192.168.1.12:11210","server_us":3,"total_us":2356},{"last_local_address":"192.168.1.10:56762","last_local_id":"000000003fc6ae73/277953824e30913b","last_operation_id":"get:0x3","last_remote_address":"192.168.1.12:11210","server_us":3,"total_us":2477},{"last_local_address":"192.168.1.10:56762","last_local_id":"000000003fc6ae73/277953824e30913b","last_operation_id":"upsert:0x5","last_remote_address":"192.168.1.12:11210","server_us":100,"total_us":2489},{"last_local_address":"192.168.1.10:56762","last_local_id":"000000003fc6ae73/277953824e30913b","last_operation_id":"get:0x4","last_remote_address":"192.168.1.12:11210","server_us":3,"total_us":2592}]}

For each service (only kv-based on this workload), the threshold log reporter will show you the total number of recorded ops (through count), and give you the top slowest ops sorted by their latency. Since only airline_10 exists in the bucket you will see five document fetches but only one mutation.

Output fields in detail.

Let’s highlight a single operation, and explain each field in a little more detail:

{
    "total_us": 1761,
    "server_us": 27,
    "last_remote_address": "192.168.1.12:11210",
    "last_operation_id": "get:0x0",
    "last_local_id": "000000003fc6ae73/277953824e30913b",
    "last_local_address": "192.168.1.10:56762"
}

This tells us the following:

  • total_us: The total time it took to perform the full operation: here around 1.8 milliseconds.

  • server_us: The server reported that its work performed took 27 microseconds (this does not include network time or time in the buffer before picked up at the cluster).

  • last_local_address: The local socket used for this operation.

  • last_remote_address: The remote socket on the server used for this operation. Useful to figure out which node is affected.

  • last_operation_id: A combination of type of operation and id (in this case the opaque value), useful for diagnosing and troubleshooting in combination with the last_local_id.

  • last_local_id: With Server 5.5 and later, this id is negotiated with the server and can be used to correlate logging information on both sides in a simpler fashion.

You can see that if the thresholds are set the right way based on production requirements, without much effort slow operations can be logged and pinpointed more easily than before.