Automated Backup Restore Validation

As we do in Flipkart

Isha Aggarwal
Flipkart Tech Blog

--

In the previous blog on how Backup and Restore as a Service or BRaaS is provided within Flipkart, we talked about why it is essential for an organization like Flipkart to have safeguards in place to protect its data against any kind of failures and disasters by storing data backups in a secured off-site repository.

In this blog, we’ll explore Restore as a Service, focussing on Automated Backup Restore Validation. We’ll discuss why it’s important, especially from Flipkart’s point of view. Let’s first get to know some important terms that will help you better understand this blog.

RPO (Recovery Point Objective): The maximum acceptable data loss during a disruption.
RTO (Recovery Time Objective): The maximum allowable downtime for restoring a network or application.
BCP (Business Continuity Plan): The strategy to keep a business running at an acceptable level during disruptions.
DR (Disaster Recovery): Part of a BCP, it’s about recovering technology systems and applications after disasters or outages.

From Flipkart’s perspective, a BCP is a vital strategy that we employ to ensure the uninterrupted operation of our critical business functions, even amidst disasters such as natural calamities or cyber-attacks. Our comprehensive framework encompasses all facets of our organization, such as our workforce, communication channels, IT systems, business partners, and stakeholders. Within this framework, we outline specific measures and instruct the system behaviour to guarantee effective response when disasters strike.

Our disaster recovery plan focuses on:

  • Restoring essential applications and data swiftly following any disaster.
  • Minimizing both the RPO and RTO during disaster recovery.
  • Reducing downtime and data loss to a minimum.

BRaaS is responsible for taking over 7,000 backups daily from over 3,000 virtual machines, securing over 400 terabytes of data every day. However, all these backups would be pointless if customers couldn’t easily restore their systems during a disaster.

Now here’s the big question — How can we ensure that this backed-up data will actually work when needed to get our systems back on track during a crisis? This is where Automated Backup Restore Validation steps in. It’s a crucial process to make sure that your backed-up data can be successfully restored when things go wrong. Let’s dig deeper into why Backup Restore Validation is so important.

Backup Restore Validation from SOX perspective

Backup Restore validation plays a major role in meeting SOX compliance requirements which revolve around key elements: Access control, Change management, Data Backup, and IT security. The Data Backup component of SOX compliance stipulates that systems must be in place to ensure that all financial records and other sensitive data are backed up — both onsite and offsite, using appropriate storage systems and these backups should be periodically tested for restorability.

The aim of this provision is to guarantee that data backups and restores serve as a safeguard against potential loss and damage in the event of a disaster. Accordingly we came up with the feature of validating the restorability of humongous data backups, in case of a disaster.

In Flipkart, out of the 1900+ BRaaS onboarded database clusters, around 70% are using MySQL as their datastore. Hence, we began with Automated Restore Validation feature of these MySQL clusters using our in-house managed MySQL service, Altair. We leveraged Altair’s proficiency in managing and upholding High Availability MySQL clusters for carrying out restore validations. We are extending this feature to other managed databases in future.

Design

We follow the following steps to perform the Automated Restore Validation using the following design:

  1. Get configs to perform restore validation upon
    We have deployed a Kubernetes(K8s) cronjob, that runs twice a day and picks first five configs that have not been restore-validated in the last 6 months.
  2. Fetch backup details for the config
    For each config to be validated, BRaaS service fetches its latest full backup and sends this information to the managed DB service, to create a new cluster for this config.
  3. Create DB cluster for each config
    The Managed DB service :
    a) Creates a new cluster.
    b) Installs relevant packages required for restoration.
    c) Restores the backup from cloud into this newly created cluster, using the information received from BRaaS service.
  4. Poll till the new cluster reaches terminal state
    BRaaS service keeps polling the managed DB service to know the status of restore till a max amount of time which is different for each automated restore validation run. When we get a terminal status in response, the restore is marked as completed.
  5. Update validation details of the cluster
    Update the validation status and completion time in BRaaS DB.
  6. Destroy the created restore validation cluster
    The test cluster is destroyed after automated restore validation is concluded.

In our architecture, we have tried to optimize several key parameters to enhance efficiency and performance of the validation process. Let’s delve into the specifics:

  1. Hardware/OS Dependency
  • Challenge: Dependency on underlying hardware and OS can limit flexibility and increase manual maintenance overhead.
  • Solution: Managed DB service manages database clusters, removing any underlying hardware or OS dependency.

2. Database Version Dependency

  • Challenge: Compatibility issues with different database versions.
  • Solution: Managed DB service maintains the DB clusters, ensuring version compatibility for system stability and maintainability.

3. Data Size and Disk Type

  • Challenge: Managing diverse data sizes and disk types.
  • Solution: New instances with SSD disks, supporting validation for data sizes up to 2200 GB.

4. Cost and Network Bandwidth Usage

  • Challenge: Determining optimal validation frequency as per SOX compliance, balancing cost, and network bandwidth.
  • Solution: Adhering to SOX compliance, annual restoration of full backups is implemented, ensuring a cost-effective strategy and resource-efficient use of network bandwidth.

What’s Next?

  1. Validation of other datastores : Currently, we are doing automated restore validation for only MySQL datastore. We plan to extend this restore validation for other managed datastores within Flipkart like TiDB.
  2. Data Certification : Currently, as part of automated restore validation for MySQL clusters, we spawn a fresh MySQL cluster, restore its latest full backup and start the MySQL service. If the MySQL service is up and running, we conclude that the backups are restorable in case of any disaster.
    But, this is a schema level validation of the backups. Our future roadmap involves going a step further and performing data certification by validating the restored data as well.

High level Design for Data Certification

As the DB cluster owners know their cluster the best, and what data needs to be validated to certify a restore, we will ask for the data validation queries from the owners, with a constraint on the total execution time of these queries, because execution of these queries will add up to the total backup and restoration time.

We take a flush table read lock on the tables, before taking the snapshot of the data, to freeze the database and to ensure the consistency of the taken snapshot. Post that, these shared DB queries will be executed every time the full backup runs for that cluster. Instead of storing the output of the queries, which could vary in length and type, we will store the checksum of the outputs in the database.

When the same backup gets restored during automated restore validation, we run the same queries again on the restored cluster and fetch the checksum output. We match this checksum with the checksum stored while running the same set of queries during the backup, to ensure that the restored data matches with the backed up data.

Summary

In this tech blog, we’ve delved into the critical role of BRaaS in guaranteeing the functionality and restorability of backups for various MySQL clusters within Flipkart. We’ve explored the architecture for Automated Restore Validation, identified its present limitations, and detailed how we’ve effectively addressed these shortcomings in the new design, leveraging the capabilities of the managed MySQL platform.

Furthermore, we’ve provided insights into the exciting future roadmap for Automated Restore Validation to encompass additional datastores like TiDB and enhance the validation process, ensuring even greater data reliability and integrity.

In the world of data restoration, it’s not just about bits and bytes, but the magic of bringing your precious data back to life. With the right tools and a sprinkle of tech wizardry (Read : BRaaS), your backups can go from zeros to heroes!

Thanks for reading :)

--

--