Slow Operations Logging

  • how-to
    +
    Tracing information on slow operations can be found in the logs as threshold logging, orphan logging, and other span metrics. Change the settings to alter how much information you collect

    To improve debuggability certain metrics are automatically measured and logged. These include slow queries, responses taking beyond a certain threshold, and orphanned responses.

    Observability Metrics

    Individual request traces present a very specific (and isolated) view of the system. In addition, it also makes sense to capture information that aggregates request data (i.e. requests per second), but also data which is not tied to a specific request at all (i.e. resource utilization).

    The deployment situation itself is similar to the tracer: either applications already have a metrics infrastructure in place or they don’t. The difference is that exposing some kind of metrics is much more common than request based tracing, because most production deployments at least monitor CPU usage etc.

    Metrics fall into the following categories:

    • Request/Response Metrics (such as requests per second).

    • SDK Metrics (such as how many open collections, various queue lengths).

    • System Metrics (such as cpu usage, garbage collection performance).

    Configuring Metrics Logging

    The information logged can be configured, although configurability does vary slightly between SDKs.

    Advanced users can build their own implementations of the interface:

    interface Meter {
         Counter counter(String name, Map<String, String> tags)
         ValueRecorder valueRecorder(String name, Map<String, String> tags)
    }
    
    interface Counter {
        void incrementBy(ulong number)
    }
    
    interface ValueRecorder {
        void recordValue(ulong value)
    }

    At each emit/log interval, the AggregatingMeter outputs a JSON structure that is very similar to the ThresholdRequestTracer or the OrphanResponseReporter. The default value for the emitInterval is 600 seconds (10 minutes).

    JSON Output Format & Logging

    The overall structure looks like this (here prettified for readability):

    {
      “meta”: {
    	“emit_interval_s”: 600,
      }
      “<service-a>”: {
        “<node-a>” {
          “total_count”: 1234,
          “percentiles_us”: {
            “50.0”: 5,
            “90.0”: 10,
            “99.0”: 33,
            “99.9”: 55,
            “100.0”: 101,
          }
        }
      },
    }

    For each service and each node, the total count and the latency percentiles are reported. This will help during debugging to get a decent idea of the latency distribution across services and nodes. For more sophisticated grouping and aggregations, users should use the forthcoming OpenTelemetryMeter, or a custom implementation.

    The emit_interval_s is reported in the meta section of the JSON output since to calculate the ops/s the total_count needs to be divided by the emit_interval_s. Since the configuration property is not always available when debugging logs it is included to make it simple.

    Threshold Logging Reporting

    Threshold logging is the recording of slow operations — useful for diagnosing when and where problems occur in a distributed environment.

    Configuring Threshold Logging

    To configure threshold logging, adjust the ThresholdRequestTracer. You should expect to see output in JSON format in the logs for the services encountering problems:

    {
      "<service-a>": {
        "total_count": 1234,
        "top_requests": [{<entry>}, {<entry>},...]
      },
      "<service-b>": {
        "total_count": 1234,
        "top_requests": [{<entry>}, {<entry>},...]
      },
    }

    The total_count represents the total amount of over-threshold recorded items in each interval per service. The number of entries in “top_requests” is configured by the sampleSize. The service placeholder is replaced with each service — “kv”, “query”, etc. Each entry looks like this, with all fields populated:

    {
      "total_duration_us": 1200,
      "encode_duration_us": 100,
      "last_dispatch_duration_us": 40,
      "total_dispatch_duration_us": 40,
      "last_server_duration_us": 2,
      "operation_name": "upsert",
      "last_local_id": "66388CF5BFCF7522/18CC8791579B567C,
      "operation_id": "0x23",
      "last_local_socket": "10.211.55.3:52450",
      "last_remote_socket": "10.112.180.101:11210"
    }

    If a field is not present (because for example dispatch did not happen), it will not be included.

    Orphaned Response Reporting

    Orphan response reporting acts as an auxiliary tool to the tracing and metrics capabilities. It does not expose an external API to the application and is very focussed on its feature set.

    The way it works is that every time a response is in the process of being completed, when the SDK detects that the original caller is not listening anymore (likely because of a timeout), it will send this “orphan” response to a reporting utility which then aggregates it and in regular intervals logs them in a specific format.

    When the user then sees timeouts in their logs, they can go look at the output of the orphan reporter and correlate certain properties that aid debugging in production. For example, if a single node is slow but the rest of the cluster is responsive, this would be visible from orphan reporting.

    Configuring Orphan Logging

    The OrphanResponseReporter is very similar in principle to the ThresholdRequestTracer, but instead of tracking responses which are over a specific threshold it tracks those responses which are “orphaned”.

    The emitInterval and sampleSize can be adjusted (defaults are 10s and 10 samples per service, respectively). The overall structure looks like this (here prettified for readability):

    {
      “<service-a>”: {
        “total_count”: 1234,
        “top_requests”: [{<entry>}, {<entry>},...]
      },
      “<service-b>”: {
        “total_count”: 1234,
        “top_requests”: [{<entry>}, {<entry>},...]
      },
    }

    The total_count represents the total amount of recorded items in each interval per service. The number of entries in “top_requests” is configured by the sampleSize. The service placeholder is replaced with each service, i.e. “kv”, “query” etc. Each entry looks like this, with all fields populated:

    {
      "total_duration_us": 1200,
      "encode_duration_us": 100,
      "last_dispatch_duration_us": 40,
      "total_dispatch_duration_us": 40,
      "last_server_duration_us": 2,
      “timeout_ms”: 75000,
      "operation_name": "upsert",
      "last_local_id": "66388CF5BFCF7522/18CC8791579B567C,
      "operation_id": "0x23",
      "last_local_socket": "10.211.55.3:52450",
      "last_remote_socket": "10.112.180.101:11210"
    }

    If a field is not present (because for example dispatch did not happen), it will not be included.