Kubernetes Automated Upgrade

During your normal course of operations, you’ll no doubt need to upgrade the Kubernetes cluster on which your Couchbase deployment resides. This page details the requirements and considerations that you’ll need to take into account when performing an online upgrade of a Kubernetes cluster that’s hosting a stateful application like Couchbase Server.

Overview

The general procedure for upgrading a Kubernetes cluster starts with upgrading the Kubernetes master components (e.g. API, controllers, and database). This part of the upgrade should not affect the operation of existing Operator deployments or Couchbase cluster resources.

Once the upgrades to the master components are complete, the next step is to manually drain the Kubernetes nodes, one at a time, of their containers. After a node is completely drained, you can upgrade the kubelet component and make any other necessary hypervisor upgrades (e.g. security patching and kernel updates). Once the node has completed all of its upgrades, it can then be untainted and used again by Kubernetes to schedule and run pods on.

When upgrading Kubernetes nodes that are running Couchbase pods, there are certain requirements that you need to adhere to before and after each node has been drained. These requirements, detailed in the following sections, must be met for each node before you can move on to another node.

The exact upgrade process is dependent on your Kubernetes installation method or cloud provider. Consult the relevant third-party documentation for full instructions.

Operator and Couchbase Cluster Considerations

Kubernetes provides some resources that help with upgrades to minimize disruption due to pods being evicted from Kubernetes nodes. Pod disruption budgets, for example, may limit the number of pods matching a rule that are allowed to be down at the same time. Pods cannot be evicted from a Kubernetes node if this limit is passed. Only once the number of ready pods are within the specified tolerance, can evictions start again.

For a stateful application, like Couchbase Server, a pod being ready (responding to API requests on the admin port) after it has been evicted and recreated by the Operator, does not mean that the environment is safe to evict another pod. Instead, you must ensure that all data is safely rebalanced and replicated across the Couchbase cluster before you can evict another. Failure to do so may result in data loss.

It is possible that multiple Couchbase Server pods may be resident on the same node that is being evicted. This depends on your specific deployment configuration. If the pods belong to the same cluster, they may need to be manually failed over via the Couchbase Web Console in order for the Operator to recover and rebalance.

The Operator itself is a stateless application typically managed by a Deployment. When an Operator pod is evicted, the Deployment controller will automatically recreate a replacement pod that will begin managing CouchbaseCluster resources in its name space.

Preparation for Upgrade

Before beginning an upgrade to your Kubernetes cluster, review the following considerations and prerequisites:

Because an eviction deletes a pod, ensure that the Couchbase cluster is scaled correctly so that it can handle the increased load of having a pod down while a new pod is balanced into the cluster.
To minimize disruption, ensure that a short failover period is configured with the couchbaseclusters.spec.cluster.autoFailoverTimeout parameter to reduce down time before another node takes over the load.
Ensure that there is capacity in your Kubernetes cluster to handle the scheduling of replacement Couchbase pods. For example, if a Couchbase cluster were running on Kubernetes nodes marked exclusively for use by Couchbase, and anti-affinity were enabled as per the deployment best practices, the Kubernetes cluster would require at least one other node capable of scheduling and running your Couchbase workload. For clusters deployed with server group support, this would require another node per availability zone.

The Kubernetes cluster should not be upgraded while any Couchbase clusters are in hibernation. Instructions to wake the cluster can be found in the hibernation documentation.

Performing the Upgrade

There are two supported methods for upgrading a Kubernetes cluster: Automatic and manual.

Automatic Upgrade

To prevent downtime or a data loss scenario, the Operator provides controls for how automated Kubernetes upgrades proceed.

A PodDisruptionBudget is created for each CouchbaseCluster resource created. The PodDisruptionBudget specifies that at least the cluster size minus one node (N-1) be ready at any time. This constraint allows, at most, one node to be evicted at a time. As a result, it’s recommended that to support an automatic Kubernetes upgrade, the cluster be deployed with anti-affinity enabled to guarantee only a single eviction at a time.

To ensure Couchbase Server nodes report that they are ready once they are added to the cluster and balanced in, the Kubernetes Pods are created with a readiness gate. The readiness gate is only set once all of the Couchbase Server nodes are balanced into the cluster and the cluster is in a stable condition that can tolerate the loss of a Pod without losing data.

Manual Upgrade

When you drain a Kubernetes node containing a Couchbase pod, the following sequence of events will occur:

The Operator will detect that a cluster member is down. You should monitor this via the Operator logs or Couchbase Web Console.
The Couchbase pod will be failed over and the Operator will create a replacement pod in order to restore server class sizing.
The Operator will initiate a rebalance to redistribute data across the cluster and restore redundancy constraints.

Only when the rebalance is successfully complete can you safely drain and upgrade another Kubernetes node containing a Couchbase pod belonging to the same Couchbase cluster.

Graceful Pod Rescheduling During Node Maintenance

When performing maintenance operations such as node drains or pod evictions in Kubernetes, Couchbase Server pods may be terminated abruptly without graceful shutdown.This can cause transient write failures as the cluster SDK continues routing operations to the terminated node until auto-failover completes and the cluster map updates.

To prevent application-level interruptions during planned maintenance, the Couchbase Autonomous Operator provides mechanisms for graceful pod rescheduling.

Manual Pod Rescheduling

If you know in advance that a worker node will be drained, you can manually trigger graceful pod rescheduling by cordoning the node, and applying the cao.couchbase.com/reschedule annotation.

When this annotation is applied to a Couchbase Server pod, the Operator:

Gracefully removes the pod from the cluster (via failover or rebalance)
Waits for the operation to complete
Deletes and reschedules the pod on a suitable node

To trigger manual rescheduling:

kubectl annotate pod <pod-name> cao.couchbase.com/reschedule=true

For more details, see Pod Rescheduling.

Automated Pod Rescheduling with Eviction Webhook

For automated handling of node drains and pod evictions, Couchbase provides an open-source Eviction Reschedule Hook. This webhook monitors eviction events and automatically applies the cao.couchbase.com/reschedule annotation, allowing the Operator to handle graceful rescheduling without manual intervention. Note that this is intended to protect evicting pods from drained nodes, triggered by the kubectl drain command. This is not intended to be used with manual eviction requests.

The webhook provides:

Automatic detection of pod eviction events
Graceful pod removal before Kubernetes terminates the pod
Zero-downtime node drains for planned maintenance

For detailed setup instructions and architecture, see the Rethinking Node Drains: A Webhook Based Approach to Graceful Pod Removal blog post.

The automated webhook approach is recommended for production environments where node maintenance is frequent or unpredictable.