Setting up Disaster Recovery
- Sync Gateway 4.0
How to set up a Sync Gateway mobile cluster for Disaster Recovery (DR) using Couchbase Server’s Cross Data Center Replication (XDCR)
Introduction
Couchbase Server Cross Data Center Replication (XDCR) replicates data between two or more autonomous Couchbase Server clusters. It plays an important role in supporting Disaster Recovery (DR) and Data Migration, even where Sync Gateway is the normal replicator of choice for mobile data.
Recommended Deployment Models
Zero Downtime Active-Active Disaster Recovery
This model provides zero-downtime disaster recovery using bi-directional XDCR between two active mobile clusters.
This requires running Sync Gateway 4.0+ on both sides of the active-active XDCR setup.
Both clusters remain operational, with seamless fail-over through load balancer switching.
You must configure both clusters with import_docs=true.
To set up zero-downtime disaster recovery:
-
Configure bi-directional XDCR between the Primary and disaster recovery clusters.
Enable automatic filtering of cluster specific metadata. -
Deploy Sync Gateway in active mode on both clusters.
-
Configure users, roles, and databases independently on both clusters.
XDCR replicates documents and attachments, but you must configure users, roles, and databases separately on each cluster. -
Configure your load balancer to route traffic primarily to the Primary cluster.
-
Verify replication health between the two active clusters.
To activate disaster recovery:
-
Update load balancer configuration to redirect traffic to the disaster recovery cluster.
This process requires no Sync Gateway service interruption. -
Verify disaster recovery cluster is handling traffic properly.
-
Maintain bi-directional replication for recovery preparedness.
The original primary becomes the new DR cluster automatically and requires no manual XDCR reconfiguration.
Clusters in Same Region
This model caters for situations where the Active and Disaster Recovery clusters are in the same region or datacenter — see: Figure 2. It includes an optional optimization step, which confirms that there is no downtime during the activation stage.
To set up and maintain a disaster recovery cluster:
-
[Optional step — for optimization] Start Sync Gateway with
offline: truein the Disaster Recovery cluster to asynchronously create indexes. Creating all indexes beforehand reduces switching costs.
If you skip this test, you’ll incur latency when Sync Gateway switches to the Disaster Recovery cluster and Sync Gateway rebuilds its indexes. -
Connect Sync Gateway to your Primary cluster.
-
Start the unidirectional XDCR from the Primary cluster to the Disaster Recovery cluster.
When you’re ready to switch to Disaster Recovery operations:
-
Stop the replication (XDCR) from the Primary cluster to Disaster Recovery cluster.
-
After you stop XDCR: Switch the Load Balancer to point to the Sync Gateway on the Disaster Recovery cluster. This approach keeps the deployment of Sync Gateway at only 1 end of the XDCR replication.
-
Promote the Disaster Recovery cluster to Primary and the old Primary to Disaster Recovery.
-
Flush all replicated buckets in the Primary cluster as a precaution against any spurious writes that enter the Primary cluster and XDCR fails to replicate when you stop it.
-
Reverse the XDCR to replicate from the newly promoted Primary to the old Primary to set up a new Backup.
Clusters in Different Regions or Data Centers
This model caters for situations where the Active and Disaster Recovery clusters are in different regions or data centers. Although the model has a separate Sync Gateway cluster attached to the Disaster Recovery cluster, this approach keeps the deployment of Sync Gateway at only 1 end of the XDCR replication. The optional optimization step confirms that there is no downtime during the activation stage.
To set up and maintain a disaster recovery cluster - see: Figure 4:
-
[Optional step — for optimization] Start Sync Gateway with
offline: truein the Disaster Recovery cluster to asynchronously create indexes. If you skip this test, you’ll incur latency when you switch Sync Gateway to the Disaster Recovery cluster and Sync Gateway rebuilds its indexes. -
[Critical step] Turn off all the Sync Gateways in the Disaster Recovery cluster.
-
Start the unidirectional XDCR from the Primary cluster to the Disaster Recovery cluster.
When you’re ready to switch to Disaster Recovery operations — see: Figure 5:
-
Stop Sync Gateway on the Primary cluster
-
Stop the replication (XDCR) from the Primary cluster to the Disaster Recovery cluster.
-
Verify that any and all Load Balancer updates to direct all traffic to the new Sync Gateway clusters.
-
Turn on the Sync Gateway cluster in the Disaster Recovery cluster.
-
Assign the Disaster Recovery cluster to be the new Primary cluster, and make the old Primary cluster the new Disaster Recovery cluster.
-
Flush all replicated buckets in the Primary cluster as a precaution against any spurious writes coming into the Primary cluster that XDCR did not replicate when you stopped it.
-
Reverse the original XDCR to replicate from the newly promoted Primary to the old Primary, to set up a new Backup.