This month’s T-SQL Tuesday is about keeping a healthy SQL environment. You can read more at Pearl Knows’ post on it. One of the things I’ve done in our environment is automate restores to a test machine to validate that we could successfully restore. I’ve blogged some about pieces of that script and will likely do that more going forward as it evolves but what I want to write about today is the statistics that have come out of that routine.
The Restore Setup
To start with, our standard backup job drops a backup on a local disk and then Tivoli, on a different schedule, run and pulls it to its storage. The next morning the restore process picks five random databases, finds the latest backup by querying MSDB, restores it from Tivoli to a test machine, does a SQL restore, and runs an integrity check. Any time it failed to do a restore I noted the reason, resolved it, and finished the restore to make sure there were no other issues. I also noted what the failure was so I could go back and run metrics on why we couldn’t restore a backup.
There was a 5.3% failure rate excluding cases where the database picked a database that was intentionally not backed up. This includes cases where the backup was sitting on disk but hadn’t been picked up by Tivoli prior to the restore routine started. I decided to count those as failures since if we had a disk failure between when the backup happened (typically 8pm) and when the restore happened (at 5am) there was a loss. This is a higher failure rate than we would like to see so let’s look into what the failures were.
Breaking down the failures, 89.2% were due to Tivoli not having the backup file. By far the bulk of those were caused by scheduling differences between when the SQL backup ran and when the Tivoli backup ran. Some were missed entirely because the SQL backup and the Tivoli backup were running at the same time but others had a large gap between the two, sometimes almost a day. These scheduling issues accounted half of all the failures. Fortunately, they were easy to fix by adjusting one schedule or the other.
21% of the Tivoli failures were due to transient Tivoli issues that caused backups to not happen (either system wide or on that machine on a specific night). And 15% were caused by persistent failures due to configuration.
The remaining 10.8% of the failures were due to issues with the SQL backup job. One was on a system not yet in production but should have had our maintenance job installed already. One was caused by a permissions issue, and two were because the vendor had responsibility for the system and hadn’t set up a maintenance job. Yes, I check backups on systems I’m not responsible for. Mainly this is because I want to protect the hospital I work for. It is also in my best interest to bring the issue up ahead of time instead of being pulled into a disaster situation and being asked to perform a miracle.
We didn’t have a single backup fail to restore due to backup corruption. If we could get to the backup file we were able to restore it.
Being able to restore a backup is the most important part of having one. Without knowing if your backups are restorable you won’t be able to resolve any issues preventing the restore until it’s too late. The biggest lesson for us wasn’t a surprise but it is good to have hard numbers around it.
- Our biggest point of failure is the disk backup. If I grabbed the SQL backup off disk instead of long term storage I wouldn’t have caught the majority of our failures.
- Scheduling is the single biggest issue. If you have a similar setup pay careful attention to your schedules or see if you can add a step to your backup job to have the disk backup pick up the SQL backup right after it happens.
- Verify vendor work. There are vendors out there that insist on supporting their own systems and are good at it but the bulk of the ones that we work with aren’t. If it’s documented that they have responsibility for it but at the end of the day if it’s in your environment you will probably be called on for support.
Being able to track numbers about failures will also help get the support you need from management to improve your infrastructure. In our case, we’re moving most of our backups from Tivoli to a DataDomain device so we can bypass those scheduling and agent failures.