Monitoring

Sync Gateway 2.5 includes a number of new metrics available through the Admin REST API (/_expvars). Those new metrics cover performance, resource utilization and health checks of the Sync Gateway nodes which are increasingly important as deployments scale to support a large number of connected Couchbase Lite clients.

The following curl command requests the monitoring stats on the Admin REST API.

curl -X GET "http://localhost:4985/_expvar" -H "accept: application/json"

The response is a JSON object with the following schema.

{
  "syncgateway": {
    "global": {
      "resource_utilization": {
        "admin_net_bytes_recv": 0,
        "admin_net_bytes_sent": 0,
        "error_count": 0,
        "go_memstats_heapalloc": 0,
        "go_memstats_heapidle": 0,
        "go_memstats_heapinuse": 0,
        "go_memstats_heapreleased": 0,
        "go_memstats_pausetotalns": 0,
        "go_memstats_stackinuse": 0,
        "go_memstats_stacksys": 0,
        "go_memstats_sys": 0,
        "goroutines_high_watermark": 0,
        "num_goroutines": 0,
        "process_cpu_percent_utilization": 0,
        "process_memory_resident": 0,
        "pub_net_bytes_recv": 0,
        "pub_net_bytes_sent": 0,
        "system_memory_total": 0,
        "warn_count": 0
      }
    },
    "per_db": {
      "$dbname": {
        "cache": {
          "chan_cache_active_revs": 0,
          "chan_cache_hits": 0,
          "chan_cache_max_entries": 0,
          "chan_cache_misses": 0,
          "chan_cache_num_channels": 0,
          "chan_cache_removal_revs": 0,
          "chan_cache_tombstone_revs": 0,
          "num_skipped_seqs": 0,
          "rev_cache_hits": 0,
          "rev_cache_misses": 0
        },
        "cbl_replication_pull": {
          "attachment_pull_bytes": 0,
          "attachment_pull_count": 0,
          "dcp_caching_count": 0,
          "dcp_caching_time": 0,
          "max_pending": 0,
          "num_pull_repl_active_continuous": 0,
          "num_pull_repl_active_one_shot": 0,
          "num_pull_repl_caught_up": 0,
          "num_pull_repl_since_zero": 0,
          "num_pull_repl_total_continuous": 0,
          "num_pull_repl_total_one_shot": 0,
          "request_changes_count": 0,
          "request_changes_time": 0,
          "rev_processing_time": 0,
          "rev_send_count": 0,
          "rev_send_latency": 0
        },
        "database": {
          "crc32c_match_count": 0,
          "dcp_caching_count": 0,
          "dcp_caching_time": 0,
          "dcp_received_count": 0,
          "dcp_received_time": 0,
          "doc_reads_bytes_blip": 0,
          "doc_writes_bytes": 0,
          "doc_writes_bytes_blip": 0,
          "num_doc_reads_blip": 0,
          "num_doc_reads_rest": 0,
          "num_doc_writes": 0,
          "num_replications_active": 0,
          "num_replications_total": 0,
          "sequence_get_count": 0,
          "sequence_released_count": 0,
          "sequence_reserved_count": 0,
          "warn_channels_per_doc_count": 0,
          "warn_grants_per_doc_count": 0,
          "warn_xattr_size_count": 0
        },
        "security": {
          "auth_failed_count": 0,
          "auth_success_count": 0,
          "num_access_errors": 0,
          "num_docs_rejected": 0,
          "total_auth_time": 0
        }
      }
    },
    "per_replication": {
      "$replname": {
        "sgr_active": true,
        "sgr_docs_checked_sent": 0,
        "sgr_num_attachments_transferred": 0,
        "sgr_num_attachment_bytes_transferred": 0,
        "sgr_num_docs_failed_to_push": 0,
        "sgr_num_docs_pushed": 0
      }
    }
  }
}

syncgateway

Monitoring stats

global

Global Sync Gateway stats

resource_utilization

Resource utilization stats

admin_net_bytes_recv

Total number of bytes received on the network interface that the Sync Gateway admin interface is bound to, since start-up.

By default, that is the number of bytes received on 127.0.0.1 since start-up.

Throughput on admin interface = admin_net_bytes_recv / admin_net_bytes_sent

admin_net_bytes_sent

Total number of bytes sent on the network interface that the Sync Gateway admin interface is bound to, since start-up.

By default, that is the number of bytes sent on 127.0.0.1 since start-up.

Throughput on admin interface = admin_net_bytes_recv / admin_net_bytes_sent

error_count

Number of errors logged.

go_memstats_heapalloc

Go memstats.HeapAlloc

go_memstats_heapidle

Go memstats.HeapIdle

go_memstats_heapinuse

Go memstats.HeapInuse

go_memstats_heapreleased

Go memstats.HeapReleased

go_memstats_pausetotalns

Go memstats.PauseTotalNs

go_memstats_stackinuse

Go memstats.StackInuse

go_memstats_stacksys

Go memstats.StackSys

go_memstats_sys

Go memstats.Sys

goroutines_high_watermark

Peak number of go routines since process start.

num_goroutines

Number of goroutines.

process_cpu_percent_utilization

CPU utilization (%).

The CPU usage calculation is performed based on user and system CPU time and doesn’t include components such as iowait. Therefore, process_cpu_percent_utilization differs from the %Cpu value returned when running the top command.

process_memory_resident

Memory utilization (Resident Set Size) for the process in bytes.

pub_net_bytes_recv

Total number of bytes received on the network interface that the Sync Gateway public interface is bound to, since start-up.

By default, that is the number of bytes received on 0.0.0.0 since start-up.

Throughput on public interface = pub_net_bytes_recv / pub_net_bytes_sent

pub_net_bytes_sent

Total number of bytes sent on the network interface that the Sync Gateway public interface is bound to, since start-up.

By default, that is the number of bytes sent on 0.0.0.0 since start-up.

Throughput on public interface = pub_net_bytes_recv / pub_net_bytes_sent

system_memory_total

Total memory available on the system in bytes.

warn_count

Number of warnings logged.

per_db

Stats for each database declared in the config file

$dbname

Stats relative to a database declared in the config file.

cache

Stats relative to caching

abandoned_seqs

The number of skipped sequences that were not found after 60 minutes and were abandoned.

chan_cache_active_revs

The number of active revisions in the channel cache.

chan_cache_hits

Channel cache requests fully served by the cache.

Channel Cache Hit Ratio = chan_cache_hits / (chan_cache_hits + chan_cache_misses)

chan_cache_max_entries

Size of the largest channel cache.

Helps with channel cache tuning, and as a hint on cache size variation (when compared to average cache size).

chan_cache_misses

Channel cache requests not fully served by the cache.

Channel Cache Hit Ratio = chan_cache_hits / (chan_cache_hits + chan_cache_misses)

chan_cache_num_channels

Number of channels being cached.

Insight into total number of channels being cached - provides insight into potential max cache size (num channels * max_cache_size), as well as node usage.

chan_cache_removal_revs

The number of removal revisions in the channel cache.

Acts as a reminder that removals must be considered when tuning the channel cache size. Also helps users understand whether they should be tuning tombstone retention policy (metadata purge interval), and running compact.

chan_cache_tombstone_revs

The number of tombstone revisions in the channel cache.

Acts as a reminder that tombstones and removals must be considered when tuning the channel cache size. Also helps users understand whether they should be tuning tombstone retention policy (metadata purge interval), and running compact.

num_skipped_seqs

Number of skipped sequences.

Helps with channel cache tuning, and as a hint on cache size variation (when compared to average cache size).

rev_cache_hits

Revision cache hits.

Rev Cache Hit Ratio = rev_cache_hits / (rev_cache_hits + rev_cache_misses)

rev_cache_misses

Revision cache misses.

Rev Cache Hit Ratio = rev_cache_hits / (rev_cache_hits + rev_cache_misses)

cbl_replication_pull

attachment_pull_bytes

Average size of attachments pulled. This is the pre-compressed size.

attachment_pull_count

Number of attachments pulled.

dcp_caching_count

This metric can be used to calculate the time between seeing a change on the DCP feed and when it’s available in the channel cache.

DCP cache latency = dcp_caching_time / dcp_caching_count

dcp_caching_time

This metric can be used to calculate the time between seeing a change on the DCP feed and when it’s available in the channel cache.

DCP cache latency = dcp_caching_time / dcp_caching_count

max_pending

High watermark for number of documents buffered during feed processing, waiting on a missing earlier sequence.

num_pull_repl_active_continuous

Gauge representing the number of continuous pull replications in the active state.

num_pull_repl_active_one_shot

Gauge representing the number of one-shot pull replications in the active state.

num_pull_repl_caught_up

Gauge representing the number of replications which have caught up to the latest changes.

num_pull_repl_since_zero

Number of new replications starting per second (/_changes?since=0).

num_pull_repl_total_continuous

Gauge representing the number of continuous pull replications.

num_pull_repl_total_one_shot

Gauge representing the number of one-shot pull replications.

request_changes_count

This metric can be used to calculate the latency of _changes request.

_changes request latency = request_changes_time / request_changes_count

request_changes_time

This metric can be used to calculate the latency of _changes request.

_changes request latency = request_changes_time / request_changes_count

rev_processing_time

The total amount of time processing revisions.

This metric can be used with rev_send_count to calculate the average processing time per revision.

average processing time per revision = rev_processing_time / rev_send_count.

rev_send_count

The total amount of time processing revisions.

This metric can be used with rev_send_count to calculate the average processing time per revision.

average processing time per revision = rev_processing_time / rev_send_count.

rev_send_latency

In a pull replication, Sync Gateway sends a /_changes request to the client. The client responds with the list of revisions that it wants to receive.

rev_send_latency is measuring the time between the client asking for some revisions via the /_changes response, and Sync Gateway sending that revision to the client.

Measuring time from the /_changes response means that this stat will vary significantly depending on the changes batch size. A larger batch size will result in a spike of this stat, even if the processing time per revision is unchanged. A more useful stat might be the average processing time per revision (rev_processing_time / rev_send_count).

database

Stats relative to the database

crc32c_match_count

Count of instances during import when the document cas had changed, but the document body was not changed.

dcp_caching_count

Count of DCP mutations added to Sync Gateway’s channel cache. Can be used with dcp_caching_time to monitor cache processing latency.

dcp_caching_time

Time between DCP mutation arriving at Sync Gateway and being added to channel cache (aggregate).

dcp_received_count

Number of document mutations received by Sync Gateway over DCP.

dcp_received_time

Time between document write and document being received by Sync Gateway over DCP. If the document was written prior to Sync Gateway starting the feed, is measured as the time since the feed was started. Can be used to monitor DCP feed processing latency.

doc_reads_bytes_blip

Total number of bytes read via Couchbase Lite 2.x replication since Sync Gateway startup.

doc_writes_bytes

Total number of bytes written as part of document writes since Sync Gateway startup.

doc_writes_bytes_blip

Total number of bytes written as part of Couchbase Lite 2.x document writes since Sync Gateway startup.

num_doc_reads_blip

Count of the number of documents read via Couchbase Lite 2.x replication since Sync Gateway startup.

num_doc_reads_rest

Count of the number of documents read via the REST API since Sync Gateway startup. Includes Couchbase Lite 1.x replication.

num_doc_writes

Count of the number of documents written via any means since Sync Gateway startup.

num_replications_active

Approximate number of active replications. Only counts continuous pull replications.

num_replications_total

Count of the number of replications created since Sync Gateway startup.

sequence_get_count

Number of high sequence lookups.

sequence_released_count

Number of unused, reserved sequences released by Sync Gateway.

sequence_reserved_count

Number of sequences reserved by Sync Gateway.

Security

Stats relative to security

auth_failed_count

Number of unsuccessful authentications. Useful to monitor the number of authentication errors.

auth_success_count

Number of successful authentications. Useful to monitor the number of authenticated requests.

num_access_errors

Count of documents rejected by write access functions (requireAccess/requireRole/requireUser).

num_docs_rejected

Count of documents rejected by the sync function. Useful to debug sync function issues and identify unexpected incoming documents.

total_auth_time

Total time it took to authenticate the last incoming request.

per_replication

Stats for each replication between Sync Gateway instances declared in the config file.

$replname

Stats relative to a replication between two Sync Gateway instances declared in the config file.

sgr_active

Whether the replication is active at this time.

This can be useful when analyzing replication history, and to filter by active replications.

In a one-shot replication, the value is true for the duration of the replication, and then false when it has completed or if it is cancelled.

In a continuous replication, the value is true for the duration of the replication, and also once it has caught up (i.e is in the idle state). The value is false if the replication is explicitly cancelled.

sgr_docs_checked_sent

Number of documents checked for changes since start-up (via /{db}/_revs_diff).

sgr_num_attachments_transferred

Number of attachments transferred since start-up.

sgr_num_attachment_bytes_transferred

Number of attachment bytes transferred since start-up.

sgr_num_docs_failed_to_push

Number of documents that failed to be pushed since start-up.

sgr_num_docs_pushed

Number of documents that were pushed since start-up.