How to Back Up a MongoDB Sharded Cluster using File System Snapshots

How to Back Up a MongoDB Sharded Cluster using File System Snapshots

Profile picture for user OnyanchaBrianHenry
Onyancha Brian Henry
23 February 2021

Taking backups for a MongoDB sharded cluster is not as straightforward as a backup for a standalone instance. This blog shows how the file system snapshots procedure is used to backup and restore a MongoDB sharded cluster.

Considerations Before Taking a Sharded Cluster Backup

Before making a backup, below are some of the outlines one should take into consideration.

  1. Balancer: as we know, data is distributed through a number of shards in the form of chunks that migrate according to shard key configurations. It is recommended to stop the balancer before making a backup to avoid compromising the backup’s integrity. Another reason to stop the balancer is the artifacts may have duplicated data or be incomplete while chunks are migrating during the backup.
  2. Data Consistency: the journal can be very essential in making a point-in-time snapshot but you have to make some considerations in relation to the data files. If the data files and the journal are hosted in the same logical volume, a single point-in-time snapshot can be used to capture a consistent copy of the data files. If the data files are in a separate logical volume from the journal, one must use db.fsyncLock() to flush all pending write operations and lock the mongod instance. This prevents additional writes that would otherwise lead to data inconsistency, ensuring data files are safe to copy. After that, one can use db.fsyncUnLock()to remove the lock and proceed with normal database operations.
  3. Exact Moment-in-Time Precision: data at the exact time the backup is made may be required, and therefore there is a need to stop all application writes if we don’t want an approximated moment-in-time data.
    If you need an approximate of point-in-time snapshots, you can backup from a secondary member of each replica set shard just to reduce the impact that may arise from making the backup.
  4. Encrypted Storage Engine for MongoDB Enterprise

A unique counter block value with a key must be used for every process if the encrypted storage engine uses an AES256-GCM.

  • If you are going to restore from a Hot Backup (while mongod is running), MongoDB version 4.2 will detect “dirty” keys on startup and automatically rollover the database key to avoid Initial Vector reuse.
  • In case of a cold backup( while mongod is not running), the restoration process does not direct the database to detect “dirty” keys on startup; hence Initialization Vector will be reused to guarantee integrity and confidentiality. You can disable the reuse of this vector by adding the --eseDatabaseKeyRollover option if using MongoDB 4.2+. In this case, the mongod instance will rollover if the database keys were configured using the AES256-GCM cipher.

    For deployments depending on the Amazon Elastic block storage (EBS) and the RAID 10 Configuration, for one to get a consistent state across all disks using the snapshot tool of the platform, there are two alternatives:
  • All writes should be flushed to disk and a write lock is established to ensure a consistent state during the backup process. If this procedure is chosen, one has to use the backup instances with Journal Files on a separate volume or without journaling.  
  •  Besides the RAID configuration, configure the LVM to run and hold the MongoDB data files and then create a snapshot.

Steps to backup  MongoDB Sharded Cluster 

  1. Disabling of the balancer

By connecting to the cluster mongos through a mongo shell, disable the balancer with the commands:

use config
sh.stopBalancer()

The balancer will be stopped immediately if any balancing is in progress; otherwise,  the operation must wait for the balancing to be completed.

With MongoDB 4.2+ the sh.stopBalancer() command disables sharded cluster auto-splitting.

  1. Locking a secondary member of each replica set.
    This process is necessary if journaling is not enabled in the secondary or if data files are in a separate logical volume from the journal.
    If this is the case, one must perform the locking for one secondary member and a secondary of the config server replica set for each shard.

Suppose the backup is going to take long enough. In that case, you must ensure the oplog size is sufficient enough to record changes in the primary that will be needed to update these secondaries when the backup process is complete. To view the oplog status, use:

rs.printReplicationInfo().

Before making the lock ensure that the secondary member you are locking has replicated data to some point. This can be verified by performing some write operation with “majority” write concern and checking the control collection. I.e

use config
db.BackupControl.findAndModify(
   {
     query: { _id: 'BackupControlDocument' },
     update: { $inc: { counter : 1 } },
     new: true,
     upsert: true,
     writeConcern: { w: 'majority', wtimeout: 15000 }
   }
);

This operation should return:

{ "_id" : "BackupControlDocument", "counter" : 1 }

You will only lock the secondary member that contains the latest control document.

The command for locking is db.fsyncLock.

The same procedure will be applied when locking config server replica set secondary.                                             

  1. Back up one of the config server.
    This procedure enables us to obtain the metadata as part of the backup, and it should be performed against the locked CSRS secondary member.

  2. For each shard, back up a replica set member.
    This secondary member has to be the one that was locked. The shards can also be backed up in parallel.

  3. Unlock all locked replica set members.
    Use db.fysncUnlock to unlock all replica set members and the config servers that had been locked.

  4. Enable the balancer
     

    sh.startBalancer()


    In version 4.2 of MongoDB auto-splitting for the sharded cluster will also be enabled.

Summary

Making a backup for a sharded cluster is not a hard task as it may seem. The core thing to take into consideration is data integrity that will be reflected by the database during the time the backup is made. That will be achieved by locking the members that have to be backed up.

Tags