cbbackupmgr cloud

    +

    Storing cbbackupmgr archive directly in the cloud

    DESCRIPTION

    This document should give you a basic understanding of how to manage cbbackupmgr archives directly in S3.

    TUTORIAL

    CREDENTIALS

    Backing up directly to an external cloud provider will mean that we require permissions to access the given data store. Each provider has its own way of authentication; see the sections below on how to authenticate for your chosen cloud provider.

    AWS

    When using AWS S3, there are multiple different ways that you can supply credentials to authorize yourself to AWS S3. Below is a list of the supported techniques:

    1. Providing a set of environment variables including:

      • AWS_REGION

      • AWS_ACCESS_KEY_ID

      • AWS_SECRET_ACCESS_KEY

    2. Loading credentials from the shared config files located at:

      • $HOME/.aws/config

      • $HOME/.aws/credentials

    3. Providing static config/credentials using the cli flags:

      • --obj-access-key-id

      • --obj-region

      • --obj-secret-access-key

    Setting up cbbackupmgr to interact with AWS should be a very similar process to setting up the aws-cli. The steps to configure the aws-cli can be found at https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration.

    The staging directory

    One of the most important concepts behind how backup to object store works is the staging directory. The staging directory is a location on disk where temporary data is stored during the execution of a sub-command. For a backup/restore this will be DCP metadata and storage indexes.

    When creating an archive to store in a cloud provider you are required to provide a location for the obj-staging-dir. This is a local location where archive meta data will be stored. During a backup, files will be stored here before they are uploaded to the cloud. Note that cbbackupmgr doesn’t store any document values in the staging directory; they are streamed directly to the cloud.

    Each cloud archive must have a unique staging directory i.e. they can’t be shared. cbbackupmgr will detect cases where the staging directory is being reused across archives.

    Any modifications to the cloud archive (using the web-ui or cli tools) and not cbbackupmgr are not supported whilst using the same staging directory. If a cloud archive has been modified, the staging directory should be removed and recreated before using cbbackupmgr to interact with the archive again.

    The staging directory is only used during operations e.g. backup/restore and can be safely deleted once an operation completes; this is because all the files will have been uploaded to the cloud.

    The staging directory can become quite large during a normal backup depending on the number of documents being backed up, and the size of their keys. See the Disk requirements section for more information about how to provision the staging directory.

    Configuring cloud backups

    The first step is to create a backup archive in object store. This can be done with the config command and only needs to be done once. All other commands will automatically download the archive meta data in the directory provided via the obj-staging-dir argument prior to performing any operations; this is done regardless of whether the archive exists locally because we must ensure the archive in the staging directory is up to date. Below is an example of how you would configure an archive in AWS S3.

    $ cbbackupmgr config -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging
    Backup repository `repo` created successfully in archive `s3://bucket/archive`

    Assuming your credentials are correct, the archive should now reside directly in the provided S3 bucket. To verify, you could use the aws-cli to list the contents off the bucket and they should be identical to that which would exist for a local backup.

    Although it’s possible to have cbbackupmgr coexist in the same S3 bucket as other general purpose storage, we recommend using a bucket which cbbackupmgr has exclusive access too.

    Backing up a cluster

    Once an archive is configured performing a backup works in a similar fashion to performing a local backup. It’s important to note that when backing up directly to S3 a certain amount of disk space will be used to stage local meta data files and storage indexes. See the Staging Directory section for more information. Below is an example of doing a backup and storing directly in AWS S3.

    $ cbbackupmgr backup -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \
      -c http://10.101.101.112:8091 -u Administrator -p password
    Copied all data in 1m2.525397237s (Avg. 51.61KB/Sec)      7303 items / 3.12MB
    beer-sample             [===========================================] 100.00%
    Backup successfully completed
    Backed up bucket "beer-sample" succeeded
    Mutations backedup; 7303, Mutations failed to backup: 0
    Deletions backedup: 0, Deletions failed to backup: 0

    Performing incremental backups works exactly as it would if you were performing an incremental locally; simply rerun the command above and an incremental backup would be created.

    When choosing the amount of threads to use it’s important to consider that when backing up to the cloud, cbbackupmgr buffers data in memory before uploading it. This means that choosing an extremely large amount of threads when using a poor internet connection could lead to a scenario where your machine runs out of memory.

    To learn more about backup options see cbbackupmgr-backup.

    Restoring a backup / multiple incremental backups

    Once you have created a backup, restoring it works in a similar way to restoring a local backup. It’s worth noting that restoring a backup to a cluster that’s hosted outside of AWS is likely to be significantly more expensive than performing a backup (depending on the size of your dataset). See Costing for more information.

    Below is an example of restoring a backup that is store in AWS S3.

    $ cbbackupmgr restore -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \
      -c http://10.101.101.112:8091 -u Administrator -p password
    (1/1) Restoring backup 2020-03-19T15_35_00.467487218Z
    Copied all data in 28.048019272s (Avg. 103.21KB/Sec)     7303 items / 2.82MB
    beer-sample             [==========================================] 100.00%
    Restore bucket 'beer-sample' succeeded
    Mutations restored: 7303, Mutations failed to restore: 0
    Deletions restored: 0, Deletions failed to restore: 0
    Skipped due to purge number or conflict resolution: Mutations: 0 Deletions: 0
    Restore completed successfully

    Disk requirements

    As discussed in Staging Directory, you will be required to provision enough disk space to store all the keys for your dataset on disk during a backup. We would recommend doing a simple calculation to determine the approximate size of the staging directory.

    Using the formula below, you can calculate the approximate size of the staging directory in Gigabytes:

    `(${NUMBER_OF_ITEMS} * (${AVERAGE_KEY_SIZE_IN_BYTES} + 30)) / (1024 ^ 3)`

    Note that this is a rough estimate which doesn’t account for factors such as fragmentation, however, it should be a good starting point. Using this formula and given a dataset with 50 Million keys with an average size of 75 bytes, we’d expect to need to provision about 5GB of disk space.

    When approximating the size of the staging directory, we don’t need to account for the size of the document values because they are never stored on disk; they are uploaded directly to object store.

    Costing

    Before using the backup/restore/examine sub-commands it’s worth ensuring that you understand the costing as related to AWS S3. We recommend that you calculate how much it will cost to upload/download you dataset using the AWS S3 calculator at https://calculator.s3.amazonaws.com/index.html.

    Backup

    Backing up data from outside/inside AWS S3 is cheap; this is because at the time of writing, it doesn’t cost anything to transfer data into S3 (you only pay for the storage/requests).

    Restore

    Restoring data is another matter, AWS S3 charges users for pulling data from AWS onto the internet. This means that restoring large datasets can become quite costly if you cluster is not in AWS. Before performing a restore, use info (as described below in Interrogating backups) to determine the size of your backup. You can then use this to calculate how much it will cost to restore your backup.

    At the time of writing, restoring a backup to a cluster stored inside AWS S3 will not be significantly costly since AWS do not charge for the bandwidth inside AWS. No matter whether your cluster is hosted in/outside AWS it’s worth calculating the costs before performing a restore.

    Merging

    One of the main reasons for merging incremental backups is to save disk space. In AWS S3 space is cheap and bandwidth (to the broader internet) is expensive. This means that there isn’t a financially viable reason for merging cloud backups. For this reason merging incremental backups stored in the cloud is not supported.

    Restoring will continue to support applying incremental backups in chronological order in the same fashion that it’s would when merging e.g. you will end up with the same data in your Couchbase cluster.

    Interrogating backups

    Several tools have been made available for use with archives stored directly in the cloud, currently these are:

    Examine

    Examine can be used to query whether a document with the given key exists in the given bucket (possibly across multiple backups). The examine subcommand supports directly querying the data in S3.

    $ cbbackupmgr examine -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \
      --bucket beer-sample --key '21st_amendment_brewery_cafe'
    Key: 21st_amendment_brewery_cafe
    SeqNo: 5
    Backup: 2020-01-08T17_21_20.232087665Z
    Deleted: false
    Size: 27B (key), 29B (meta), 666B (value)
    Meta: {"flags":33554432,"cas":1578502228728479744,"revseqno":1,"datatype":1}
    Value: {"address":["563 Second Street"],"city":"San Francisco","code":"94107","country":"United States","description":"The 21st Amendment Brewery offers a variety of award winning house made brews and American grilled cuisine in a comfortable loft like setting. Join us before and after Giants baseball games in our outdoor beer garden. A great location for functions and parties in our semi-private Brewers Loft. See you soon at the 21A!","geo":{"accuracy":"ROOFTOP","lat":37.7825,"lon":-122.393},"name":"21st Amendment Brewery Cafe","phone":"1-415-369-0900","state":"California","type":"brewery","updated":"2010-10-24 13:54:07","website":"http://www.21st-amendment.com/"}

    To learn more about examine options see cbbackupmgr-examine.

    Info

    The info command can be used to query a broader archive to understand its structure and to gain an understanding of what data is backed up and where.

    $ cbbackupmgr info -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging
    Name  | Size  | # Backups  |
    rift  | 796B  | 1          |
    +  Backup                          | Size  | Type  | Source                  | Range  | Events  | Aliases  | Complete  |
    +  2020-01-08T17_21_20.232087665Z  | 796B  | FULL  | http://172.20.1.1:8091  | N/A    | 0       | 0        | true      |
    -    Bucket       | Size  | Items  | Mutations  | Tombstones  | Views  | FTS  | Indexes  | CBAS  |
    -    beer-sample  | 790B  | 7303   | 7303       | 0           | 1      | 0    | 0        | 0     |

    To learn more about info options see cbbackupmgr-info.

    List

    Listing backups stored in AWS will not be supported, we suggest that you use the newer alternative info which will show the same information with additional detail.

    Archive Locking

    It’s important that only one instance of cbbackupmgr has access to the archive at a time; this is enforced using a lockfile meaning most of the time you shouldn’t need to worry about this. However, there are some situations where cbbackupgmr may fail to ensure exclusive access to the archive:

    1. Another process (on another machine, or the local machine) already has an active lockfile.

    2. A stale lockfile exists which belongs to a system with a different hostname.

    In cases where cbbackupgmr fails to lock an archive a few simple steps can be taken:

    1. Manually ensure that nobody else is using the archive

    2. If you are certain nobody else is using the archive, locate the lockfile in S3 (it has the format lock-${UUID}.lk and is stored in the top-level of the archive).

    3. Remove the lockfile and try to continue using the archive with your own instance of cbbackupmgr.

    It’s extremely important that you only manually remove the lockfile if you a certain that there isn’t another instance of cbbackupgmr using the archive. Having two instances of cbbackupmgr running against the same archive could cause data loss through overlapping key prefixes.

    Below is an example of an archive which contains a lockfile from a system that crashed where the lockfile was never cleaned up.

    $ aws s3 ls s3://backups --recursive
    2020-04-27 09:34:10        120 archive/.backup
    2020-04-27 09:34:23         34 archive/lock-14eb923b-60a7-480a-849e-8af48e47f9ea.lk
    2020-04-27 09:34:10        520 archive/logs/backup-0.log
    2020-04-27 09:34:10        651 archive/repo/backup-meta.json

    If we attempt to use cbbackupmgr to create a backup, we should see a message similar to the one below:

    $ cbbackupmgr backup -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \
      -c 172.20.1.1:8091 -u admin -p password
    Error backing up cluster: the process '{PID}' running on '{HOSTNAME}' already holds the lock

    In this case, cbbackupmgr will not remove the lock automatically since it cannot safely determine whether the other process is active or not. We can use the information about which machine the other instance of cbbackupmgr is running on to check whether it is active. If this machine has crashed and that instance of cbbackupmgr is no longer using the archive, we can manually remove the lockfile.

    $ aws s3 rm s3://backups/archive/lock-14eb923b-60a7-480a-849e-8af48e47f9ea.lk
    delete: s3://backups/archive/lock-14eb923b-60a7-480a-849e-8af48e47f9ea.lk

    If we attempt to perform the backup once again we will see that it continues successfully; in the case that the other machine failed during a backup you may be asked to purge the previous backup using the --purge flag before you can create a new backup:

    $ cbbackupmgr backup -a s3://bucket/archive -r repo --obj-staging-dir /mnt/staging \
      -c 172.20.1.1:8091 -u admin -p password
    Copied all data in 1m2.525397237s (Avg. 51.61KB/Sec)      7303 items / 3.12MB
    beer-sample             [===========================================] 100.00%
    Backup successfully completed
    Backed up bucket "beer-sample" succeeded
    Mutations backedup; 7303, Mutations failed to backup: 0
    Deletions backedup: 0, Deletions failed to backup: 0

    Compatible Object Stores

    cbbackupmgr is tested against the cloud providers that are supported, however, in some cases it will work with compatible object stores e.g. Localstack/Scality. It’s important to note that experience may be different when interacting with compatible object stores because some have slightly different behaviors which cbbackupmgr may not explicitly handle.

    AWS

    It should be possible to use cbbackupmgr with S3 compatible object stores, however, there are some things that need to be taken into consideration. First an foremost is the features that cbbackupgmr leverages. Below is a list of S3 API features that cbbackupmgr uses but not all compatible object stores support:

    It’s important that you check whether these features are implemented on your S3 compatible object store because without them cbbackupmgr will not work as expected.

    AWS also has a slightly newer virtual addressing style the documentation for which can be found at https://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html. Not all S3 compatible object stores support this style of addressing. The errors that are returned by the SDK (and therefore cbbackupmgr) in these cases are not always clear. Before raising a support ticket about cbbackupmgr not working with an S3 compatible object store you should first try using the --s3-force-path-style argument. This will force cbbackupmgr to use the old path style addressing. From our testing with S3 compatible object stores it’s very common for this flag to be required.

    Cloud Provider Specific Features

    As stated above in the 'Compatible Object Stores' section it’s possible to use cbbackupmgr with other providers providers which expose an S3 compatible API. It’s important to note that some features may only be accessible to those using the AWS.

    AWS

    When running cbbackupmgr in an AWS instance, it may use the EC2 instance metadata to get credentials. This is disabled by default, however, may be enable by setting the CB_AWS_ENABLE_EC2_METADATA environment variable to true.

    For example, if we wanted to use cbbackupmgr with the EC2 instance metadata we would: 1) Create a role with a policy which allows S3 data manipulation (e.g. S3 Full Admin) 2) Attach that role to the instance 3) Run export CB_AWS_ENABLE_EC2_METADATA=true to enable fetching EC2 instance metadata 4) Run cbbackupmgr as described elsewhere in this tutorial

    CBBACKUPMGR

    Part of the cbbackupmgr suite