If you ask a system administrator what their worst nightmare is, you’ll get a variety of answers, but high on the list will be the moment they realise the “rm” command they just ran is deleting the wrong data. That’s essentially what happened to an unfortunate engineer at GitLab, resulting in downtime and the loss of production data.
The problem was compounded by a lack of good backups, which made it impossible for GitLab to recover quickly and meant some of the accidentally deleted data was gone forever.
GitLab is a SaaS version control platform much like GitHub. Although it’s nowhere near as popular, GitLab has some useful features that GitHub lacks, including nicer issue tracking, which are leading to increasing adoption for open source projects. As you can imagine, losing data is probably the worst thing that can happen to a version control platform.
You can read the full incident post mortem on GitLab’s blog, but in a nutshell, under heavy load from suspected spammers, replication of GitLab’s primary database to a redundant secondary was failing. Engineers attempted to manually run the replication process, but it repeatedly failed. Suspecting that the replication was failing because of stale files on the secondary server, the engineer decided to delete those files. Unfortunately, the “rm” command was run on the primary server, and many gigabytes of data were lost before the engineer realised his mistake.
It’s worth mentioning that GitLab’s handling of the incident was a paragon of transparency and openness. The company immediately began communicating with users about the cause of the downtime and kept communicating throughout the incident.
Less impressive was the disaster recovery plan that GitLab had in place. Deleting a massive chunk of production data is a very bad thing, but it needn’t be catastrophic. With proper backups in place, it shouldn’t take more than a few minutes to sync the deleted files back to the production server. In this case, when GitLab’s engineers looked in the S3 bucket that was supposed to contain their database backups, the cupboard was bare. The backup scripts were using a database dump tool that didn’t support the database version they were using, and the backups failed silently.
We all make mistakes. I doubt there’s a system administrator or IT professional reading this who hasn’t accidentally deleted the wrong thing. It happens. But, knowing that it happens, processes should be put in place to ensure that recovery is straightforward. It’s great that GitLab had backup processes, but an unverified backup is worthless. Regular backup verification should be part of every company’s disaster recovery and business continuity plan.