Integrate Prometheus

    +

    Description — Integrating Sync Gateway and Prometheus for Stats Monitoring and Alerts
    Abstract — This content explains how to integrate Sync Gateway and Prometheus to provide effective monitoring and alerts for Sync Gateway events
    Related Content — Metrics REST API | Monitor

    Introduction

    Sync Gateway’s Metrics REST API exposes stats in a Prometheusglossary icon compatible and JSON formats.

    Configuration

    The Metrics REST API is enabled by default on port 4986. To change this you need to edit your sync-gateway-config.json configuration file.

    Locate the metricsInterface setting and define the Sync Gateway URL and the port you want the API served on (for example: 4986) — see: Example 1.

    Example 1. Setting metrics interface endpoint
    "metricsInterface": "127.0.0.1:4986" (1)
    1 Here we define the Sync Gateway URL and the port (4986 in this instance) that we require the Metrics REST API to be served on.

    Integration

    You will need to integrate Sync Gateway’s metrics feed with your Prometheus deployment. Couchbase provide both a configuration file and a sample rules file to make this integration with Prometheus easier.

    Copy both the Prometheus configuration file (prometheus.yml) file and the baseline rules directory (rules/sync-gateway.rules.yml) from Sync Gateway’s release package into Prometheus’s /etc directory.

    Example 2. Files in-situ
    /etc/prometheus/prometheus.yml (1)
    /etc/prometheus/rules/sync-gateway.rules.yml (2)
    1 You can change this location by specifying the path using the command line flag --config.file when starting Prometheus
    2 You can specify a different location for the rules file by editing the path in the rule_files section of the prometheus.yml configuration file.

    Configuration

    Configuration of Prometheus to work with Sync Gateway is governed by two files, starter copies of which are provided with Sync Gateway — see: Example 3 for sample file contents.

    Prometheus Configuration File

    The provided prometheus.yml file specifies the configuration required to scrape the Sync Gateway metrics target. In this instance it defines Sync Gateway’s metricsInterface as being accessible on sync_gateway:4986/_metrics. If you have multiple Sync Gateways, you can specify all their endpoints here (as targets).

    Prometheus Rules File

    Prometheus’s rules files enable you to specify both recording and alerting rules. Sync Gateway’s out-of-the-box rule set provides a starting point, which you can customize as needed. The rules include:

    • A total queries record that adds up all query counts and saves it as sgw::gsi::total_queries

    • A few example alerts

    Example 3. Sample file contents
    • Config —  prometheus.yaml

    • Rules - sync-gateway-rules.yaml

    The config file (prometheus.yml) specifies the configuration that the Prometheus server is launched with.

    global:
      scrape_interval:     5s (1)
      evaluation_interval: 5s
    
    rule_files: (2)
      - '/etc/prometheus/rules/*'
    
    scrape_configs:
      - job_name: swg
        metrics_path: /_metrics
        static_configs:
          - targets: (3)
              - sync_gateway:4986
    1 The scrape_interval specifies the polling interval.
    This interval determines the frequency at which Prometheus will scrape data from this endpoint. You can adjust it to your needs.
    2 rules_files specifies the path to the Prometheus Rules file(s).
    The rules file defines any custom alerts based on the collected stats.
    3 The targets property specifies the list of targets making statistics available to Prometheus; here we specify Sync Gateway’s metricsInterface.
    If you have multiple Sync Gateways, you can specify each of their endpoints here.

    The rules file (sync-gateway-rules.yml) specifies the alerting and recording rules.

    groups:
      - name: sync-gateway.rules
        rules:
          - record: sgw::gsi::total_queries (1)
            expr: sum by (instance, database, job) ({__name__=~"sgw_gsi_views_.*_count"})
          - alert: TooManyAuthFailuresInLastHour
            expr: increase(sgw_security_auth_failed_count[1h]) > 1000
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: Too Many Auth Failures in Last Hour
          - alert: TooManyDocumentAccessFailuresInLastHour (2)
            expr: increase(sgw_security_num_access_errors[1h]) > 1000
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: Too many Document Access Failures in last hour
          - alert: TooManyDocumentRejectionFailuresInLastHour
            expr: increase(sgw_security_num_docs_rejected[1h]) > 1000
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: Too many Document Rejection Failures in last hour
          - alert: HighRevCacheMissRate
            expr: sgw_cache_rev_cache_misses / (sgw_cache_rev_cache_misses + sgw_cache_rev_cache_hits) >= 0.8
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: High Rev Cache Miss Rate
          - alert: HighChannelCacheMissRate
            expr: sgw_cache_chan_cache_misses / (sgw_cache_chan_cache_misses + sgw_cache_chan_cache_hits) >= 0.8
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: High Channel Cache Miss Rate
          - alert: HighDeltaCacheMissRate
            expr: sgw_delta_sync_delta_sync_miss / (sgw_delta_sync_delta_sync_miss + sgw_delta_sync_delta_cache_hit) >= 0.8
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: High Delta Cache Miss Rate
          - alert: GlobalErrorCount
            expr: increase(sgw_resource_utilization_error_count[1h]) > 1
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: An error occurred in the last hour
          - alert: WarnXattrSizeCount
            expr: increase(sgw_database_warn_xattr_size_count[1h]) > 0
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: A document had larger sync data than the maximum allowed by xattrs in the last hour
          - alert: SGRNumDocsFailedToPull
            expr: increase(sgw_replication_sgr_num_docs_failed_to_pull[1h]) > 0
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: At least one document failed to be pulled with Inter Sync Gateway Replication in the last hour
          - alert: SGRNumDocsFailedToPush
            expr: increase(sgw_replication_sgr_num_docs_failed_to_push[1h]) > 0
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: At least one document failed to be pushed with Inter Sync Gateway Replication in the last hour
    1 Here we define a recording rule.
    Recording rules allow you to compute and save the results of frequently used (or computationally expensive) expressions.
    2 Here we define an alerting rule.
    Alerting rules allow you to define alert conditions based on an expression and to send notifications when the expression is satisfied