Troubleshooting and Best Practices

    +

    What happens when more Workers are allocated for a Function?

    Couchbase Server for a specific Function limits the maximum number of workers to 64 (note the default is 3 workers). This upper limit is configured for system optimization purposes. You cannot create a function with more than this upper bound.

    When deploying (or resuming a paused function) a threshold is dynamically calculated based on node’s resources and if the number of workers exceeds this calculation, then the system automatically generates a warning message but does not prevent the Function deployment. An example follows:

    There are 104 eventing workers configured to run on 24 cores. A sizing exercise is recommended.

    When should developers use the try-catch block in Function handlers?

    As a best practice, while programming the Function handler code, for basic error handling and debugging operations, it is recommended that application developers use the try-catch block.

    Before deployment, Couchbase Server verifies the Function handler code. Only valid Functions get deployed. Using the log() option within a try-catch block(s), you can record errors. These error logs get stored in the Eventing function’s application log file. Note the venting function’s application log file on disk is specific to the node that processed the mutation and is not global across the cluster. By default, JavaScript runtime errors get stored in the system logs. Unlike system logs, troubleshooting and debugging operations are easy when you use the try-catch block and application log() options.

    During runtime, Application logs, by default, do not capture any handler code exceptions. To log exceptions, it is recommended to encapsulate your code in a try-catch block.

    A sample try-catch block is provided for reference:

    function OnUpdate(doc, meta) {
        log('document', doc);
        try {
            var time_rand = random_gen();
            dst_bucket[meta.id + time_rand] = doc;
        } catch(e) {
            log(e);
        }
    }

    What are bucket alias considerations during a Function definition?

    Function handlers can trigger data mutations. To avoid a cyclic generation of data changes, ensure that you carefully consider the below aspects while specifying source and destination buckets:

    • Avoid infinite recursions. If you are using a series of handlers, then ensure that destination buckets to which event handlers perform a write operation, do not have other Function handlers configured that can create a loop by triggering cyclic mutations. For example the following design demonstrates an infinite recursion:

      functionA with source bucketA target bucketB aliased as same.

      function OnUpdate(doc, meta) {
          bucketB[meta.id] = {"status":"updated by functionA"};
      }

      functionB with source bucketB target bucketC aliased as same.

      function OnUpdate(doc, meta) {
          bucketC[meta.id] = {"status":"updated by functionB"};
      }

      functionC with source bucketC target bucketA aliased as same.

      function OnUpdate(doc, meta) {
          bucketA[meta.id] = {"status":"updated by functionC"};
      }

      In the example above a single mutation in "bucketA" will create an infinite loop updating a record in "bucketB", then in "bucketC", then back to "bucketA", over and over.

      One possible solution is to change the design above such that functionC updated bucketD (instead of bucketA) we would have no recursion as follows:

      functionC (modified to write to a different bucket) with source bucketA target bucketD aliased as same.

      function OnUpdate(doc, meta) {
          bucketD[meta.id] = {"status":"updated by functionC"};
      }

      Another possible solution to the design above is to change the design is changes such that functionA performs a check to ensure that if functionC has operated on the document to cease any new mutations as follows:

      functionA (modified to stop recursion) with source bucketA target bucketB aliased as same.

      function OnUpdate(doc, meta) {
          if (doc["status"] == "updated by functionC") return;
          bucketB[meta.id] = {"status":"updated by functionA"};
      }
    • Although the Couchbase Server can flag simple infinite recursions a long chain of source and destination buckets with a series of handlers, a complex infinite recursion condition may occur. The developer, carefully consider and avoid these cases.

    • As a best practice, ensure that buckets to which the Function handler performs a write operation do not have other handlers configured for tracking data mutations.

    There is a special case of direct self-recursion, which is highly useful, when a handler chooses to create a Read-Write binding to its own source bucket we can perform document enrichment operations. In this case the direct self-recursive mutations and detected and suppressed by the Eventing framework. However this capability is only supported for the aliased JavaScript map and is not supported for mutations generated via N1QL.

    In the 6.5 release, the handler code can directly mutate (or write back) to the source bucket, e.g. direct self-recursion.
    • For example the following design is taken from the Data Enrichment, Case: 2:

      functionDirectEnrich with source bucketA target bucketA aliased as 'src'

      function OnUpdate(doc, meta) {
        log('document', doc);
        doc["ip_num_start"] = get_numip_first_3_octets(doc["ip_start"]);
        doc["ip_num_end"]   = get_numip_first_3_octets(doc["ip_end"]);
        // !!! write back to the source bucket !!!
        src[meta.id]=doc;
      }
      function get_numip_first_3_octets(ip) {
        var return_val = 0;
        if (ip) {
          var parts = ip.split('.');
          //IP Number = A x (256*256*256) + B x (256*256) + C x 256 + D
          return_val = (parts[0]*(256*256*256)) + (parts[1]*(256*256)) + (parts[2]*256) + parseInt(parts[3]);
          return return_val;
        }
      }

    In the cluster, I notice a sharp increase in the Timeouts Statistics. What are my next steps?

    When the Timeout Statistics shows a sharp increase, it may be due to two possible scenarios:

    • Increase in execution time: When the handler execution time increases, the Function execution latency gets affected, and this in turn, leads to Function backlog and failure conditions.

    • Script timeout value: When the script timeout attribute value is not correctly configured, then you encounter timeout conditions frequently.

    As a workaround, it is recommended to increase the script timeout value. Ensure that you configure the script timeout value after carefully evaluating the execution latency of the Function.

    As a best practice use a combination of try-catch block and the application log options. This way you can monitor, debug and troubleshoot errors during the Function execution.

    Why is it important that the metadata bucket be 100% memory resident?

    If the bucket you chose to hold your meta data spills over to disk access is not 100% resident, your Eventing system can essentially stall and/or slow down by orders of magnitude and you can also experience failures and/or missed mutations.

    Always make sure that the memory quota on your metadata bucket is sufficiently large to ensure a residency ratio of 100%.

    Eventing worked fine when application was first deployed but now I am getting ETMPFAIL failures.

    A low residency ratio for either the source or the destination bucket (sometimes these two can be the same) can result in a system that’s unable to keep up with rate of mutations and internal logic’s required reads and writes to the data service.

    Watch the number of documents in your buckets (source, metadata, and destination) and in particular pay close attention to the change in the resident ratio. Typically, this could be due to growth in your overall data set.

    For example, a high velocity Eventing function that is processing in excess of 12K mutations/sec with a source or destination bucket residency ratio of 100% can easily start to experience issues if the residency ratio drops below 18% (this percentage isn’t hard and fast and may vary based on a variety of factors such as the number of mutations acted on, the storage type, and so on).

    2020-03-13T11:46:32.383-07:00 [INFO] "Exception: " {"message":{"code":392,"desc": \
    "Temporary failure received from server. Try again later","name":"LCB_ETMPFAIL"}, \
    "stack":"Error\n    at OnUpdate (MyEventingFunction.js:177:25)"}

    The above error indicates that the system is under provisioned for the load. Under the hood, Eventing will try to access to the data store five (5) times with a 200ms pause between attempts. If all of the attempts fail, the handler, in this case MyEventingFunction, throws an ETMPFAIL message. This is important to understand as trapping the above exception and retrying the same operation inside your handler will only exacerbate the issue and make things worse. Of course your handler can take other actions such as creating a notification.

    There are two solutions:

    1. The first solution is to increase the memory quota of the bucket in question (thus increasing the resident ration).

    2. The second solution is to add more Data nodes, faster disk IO, and more memory to eliminate the resource bottleneck.