Operator Self-Certification Lifecycle

      Certifying your platform for use with Couchbase Autonomous Operator.

      Why Self-Certify?

      Couchbase Engineering periodically tests the Operator against a number of major platform throughout development e.g. Amazon EKS, Google GKE, Microsoft AKS and Red Hat OCP. This is a representative set of different platforms, and aims to cover the majority of those in use by end users. Official certification is often a time consuming business, so this has to be constrained.

      We appreciate that end users may not want to use one of the officially certified platforms for some of the reasons listed below:

      • Data locality — one of the major vendors may not operate in your country or region

      • On premise deployments — you may be using your own version of Kubernetes

      • Vendor lock-in — you may be tied to existing infrastructure providers

      Take a third-party storage appliance as an example. The cost of procuring the hardware, hosting it, and providing it to a Kubernetes cluster are prohibitively expensive. We need another way of supporting these users, and providing confidence the Operator will function as designed in your environment.

      What is Self-Certification?

      Self-certification began life as an internal test framework used for Operator acceptance and integration testing. It fulfills the following test criteria:

      • Kubernetes API conformance

        • Couchbase custom resources are accepted when they should

        • Couchbase custom resources are rejected when they should, either via API schema validation or the admission controller

      • Platform behavior conformance

        • Simulation of platform scheduling, upgrade, error conditions etc.

        • Validation the Operator recovers database instances in a safe and predictable manner

      • Couchbase Server feature conformance

        • Ensures that Couchbase server behaves in a safe and predictable manner

      Thus the test framework can be used to:

      • Identify where Kubernetes API rules change in an incompatible way

      • Identify where Kubernetes platform behaviors lead to unexpected and unsupportable results

      • Identify where Couchbase Server changes lead to incompatibilities with the Operator

      The logical next step was to take this framework and package it up as an Operator self certification container image. End users can run it wherever, and on whatever they wish.

      High Level Overview

      Self-certification is deployed onto the platform under test. The self certification pod runs a default set of tests that are required for platform certification. Results are stored on a persistent volume. When the self-certification image completes, an ephemeral pod is created and the results archived and copied locally.

      Individual test cases are run in separate namespaces so the whole process can be run in parallel.

      Security Overview

      Platform certification needs full administrative access to the platform under test. Installation of cluster roles, cluster role bindings and custom resource definitions will always be an administrative task, so this is a hard requirement. Additionally, to test platform behavior, the self-certification image will need to install admission controllers in order to test compatibility, and have access to global resources such as nodes and storage classes in order to correctly simulate error conditions.

      Do not run self-certification on a production cluster as it reserves the right to perform necessary cleanup operations of any resource it requires, and may interfere with other deployments on the platform. It is highly recommended that any platform under test is ephemeral — created on demand, tested, then deprovisioned.

      What is the Self-Certification Lifecycle Process?

      Couchbase Operator Self-Certification Lifecycle is a self-service offering with an easy, step-by-step process to validate the compatibility of Kubernetes platforms and other platform-specific components such as storage and networking with Autonomous Operator.

      The certification workflows consist of the following steps:

      1. Ensure platform requirements for running the self-certification tool are met.

      2. Configure the Kubernetes platform and other platform-specific components such as storage and networking correctly.

      3. Run the Operator Self-Certification Tool (shipped with Couchbase Autonomous Operator 2.3 and above).

      4. Submit the results to Couchbase for approval.

      5. The Couchbase Cloud-Native approval committee will review the results submitted. Based on the review, we will notify you whether the results meet the acceptance criteria or if issues are identified.

      6. Once approval is successful, our Couchbase Partner team will enter into a partnership agreement with the platform vendor or platform-specific component vendor, as applicable.

      All the above steps must be completed for Couchbase to accept and fully support compatibility with a Kubernetes platform and other platform-specific components.

      Running Self-Certification

      In its most basic form self-certification is started with the following command:

      $ `cao certify`

      This will run the platform certification suite for the Operator version the tool came bundled with. It will run with no parallelism, taking approximately 8 hours to run. For more information see Platform Requirements below.

      Interpreting Results

      When installed and running, the self-certification process will stream output to the console. There are two phases:

      • Preflight checks

      • Acceptance and integration tests

      Preflight checks occur before the tests are started, they performs static checks on the Kubernetes environment to ensure it supports the Operator and Couchbase Server. Examples include ensuring the underlying hypervisor supports the required number of processes etc. Any preflight failures will need to be addressed before the tests will run, and will typically involve consulting with your platform vendor.

      When the acceptance tests commence, progress will be streamed out, and eventually a summary. If all tests pass then you are certified and ready to go!

      If any tests fail then you will need to submit your results to Couchbase in order to approve the use of Operator or help diagnose any issues that need fixing.

      Submitting Results to Couchbase

      The only thing that is required for review by Couchbase is the downloaded archive that will be in the directory where the cao certify command was run. The archive will be named something like couchbase-operator-certification-20060102T150405-0700.tar.bz2. Failure to submit this archive, or simply returning a screen capture, will result in delays until it is provided.

      To submit the self-certification results to Couchbase, follow these steps:

      1. Capture the output archive in the directory where the cao certify was run; it will be named something like couchbase-operator-certification-20060102T150405-0700.tar.bz2.

      2. Capture the Kubernetes platform’s version information and other platform-specific components such as storage and networking.

      3. To upload the results to Couchbase, you will need a JIRA account for Couchbase; you can request a JIRA account here: https://issues.couchbase.com/secure/ContactAdministrators!default.jspa.

      4. Create a new JIRA ticket, project - Couchbase Kubernetes (K8S), and Summary - [Operator Self-Certification Lifecycle].

      Platform Requirements

      Kubernetes cluster size and node size directly impact how many test can be run in parallel, and inversely, how long a self-certification test takes to run:

      • Kubernetes nodes must have at least 4 GiB available memory and 2 vCPUs in order to run self-certification

        • The certification image itself needs approximately 4 GiB of memory and 2 vCPUs to run

        • Couchbase server instances require at least 2.5 GiB of memory and 2 vCPUs

      • Each test typically uses a 3 pod — the minimum supported — Couchbase cluster to execute.

      • Couchbase server is very CPU intensive, and the tests are time constrained, so we recommend using only the latest generation of CPU available on the platform

      • You need at least 2 availability zones for server group testing, although 3 are recommended for full test coverage and functionality.

      Given these constraints, and the default parallelism of 8, we can calculate:

      memory = certification + (parallelism * couchbase_cluster_size * couchbase_memory)
      memory = 4 + (8 * 3 * 3)
      memory = 76 GiB
      CPU = certification + (parallelism * couchbase_cluster_size * couchbase_cpu)
      CPU = 2 + (8 * 3 * 2)
      CPU = 50 vCPU

      Given a typical node of a Kubernetes cluster:

      • 16GiB memory (13GiB available)

      • 4 vCPUs (3.9 vCPUs available)

      We would arrive at:

      nodes = total_memory / node_memory
      nodes = 76 / 13
      nodes = ~6 nodes
      nodes = total_cpu / node_cpu
      nodes = 50 / 3.9
      nodes = ~13 nodes

      The largest of these values is 13, so you would need at least 13 nodes. However as certification needs at least 2 availability zones, having a 14 node cluster, with 7 nodes in each. For the recommended 3 availability zones, this would entail a 15 node cluster, with 5 nodes in each.