Hacker News new | ask | show | jobs
by zytek 3404 days ago
To each of you guys having those extensive backup solutions (like NAS + cloud sync, second nas, etc)...

.. do you actually TEST those backups?

This questions comes from my experience as a system engineeer who found a critical bug in our MySQL backup solution that prevented them from restoring (inconsistent filesystem). Also, a friend of mine learned the hard way that his Backblaze backup was unrestorable.

12 comments

Very true. I overheard a similar conversation last week at work: "We have set up the backup procedure for our new production databases." - "Have you tested restore?" - "Well, uhm..." - sound of JIRA ticket being opened

By the way, I misread your username and, for a second, thought you were sytse.

I'd love to hear that sound :)
Excellent point, and that begs another question: how do you actually test your backups? Of course, each case is specific, but is there a "best practice" checklist, or some general points to check for basic restoration?
Excellent question. I do test my backups and restores on a rather constant basis. Each environment within my infrastructure takes a bit of a different approach.

Application

This is by far the easiest for me to test. We have a CI/CD jon which literally makes a new environment, from scratch, and deploys our application to it in a production configuration. It runs a test suite which tests functionality across the application. Finally, it destroys the environment. It reports on each portion of the process. In this way we know exactly how long it would take to redeploy the entire application from scratch on a new infrastructure and get it up and running. This morning it took about about 6 minutes total before tests ran.

Database

We are running an RDBMS. We use a combination of daily full backup, incremental transaction log like backup, and point in time backup. Again, in our CI/CD when a full backup is taken it is pulled, loaded, and a test routine is run against it to check integrity. At this time, the recovery from the day before is destroyed. When a transaction log backup is made, CI/CD picks up this change and applies it to the full backup restore and runs a set of tests for integrity check. This leaves us with a warm standby ready to be switched over to in case of the main database server going down. We have never had to use the warm standby in an emergency but we have a test to make sure we can cut that over as well.

For point in time backup testing this goes back to our application test above. The application test will spin up with a point in time recovery of the database backup. It will test the integrity of that recovery and then test the application against it. Finally, it will swap from the point in time recovered database to the warm backup. It runs the test suite against that for integrity as well.

File Store

People often forget this but those buckets that get hold all of your file storage in the cloud can be destroyed so easily (sad, sad experience taught me this). We test those as well. I am sure you can guess at this point how we do that? CI/CD. It's a rather simple process with a ton of gain.

A few notes

People always ask me this, so I will answer it first. Yes this costs money. It's not as bad as running a second production environment. But it will cost you a bit. My follow up question is, how much does downtime cost you?

My CI/CD is always Gitlab CI at this point. I've used Jenkins. I've used Travis. I like Gitlab CI. You can do all of this with any of those.

We script literally everything. Computers are so good at repetitive tasks. Why would you EVER do anything manually? Really. If it has to do with your infrastructure, script it.

If anyone has any questions about these ideas, feel free to reach out.

How many (full time) devs and how long did that set up take?
We currently have 4 full time devs, a QA, a DBA consultant, and a Designer on the team.

Honestly, none of that took very long to set up at all. The application in this case is a Ruby on Rails backend, PostgreSQL database, Angular front end, with file storage and a few other smaller services.

Step one: We have a lot of tests and we believe in a good test suite. Are we perfect? Absolutely not. But it is important to be able to "know" the application works. Define what helps us to know it works, and automate tests to do that. Things like "Can you log in?", "Can you select a record of type X, Y, Z, A, B, and C and do those records have the data you would expect in the right places?" You can have a human do this, or you can automate it. Automate it.

Step two: Automate your deployment. The rails application is bundled into a docker container. We use ECS (Elastic Container Service) to maintain our environments. CI/CD first runs tests, second, builds the latest docker container, third, places the docker container into a repo, fourth , deploy out the container to the correct ECS environment, five, profit! This is all automated and works the same every time with checks and balances along the way. Our Angular application runs out of S3 buckets with cloudfront caching. This was a matter of using webpack to compile the angular application down to production deployable artifacts and than a simple bash script to move those artifacts to the S3 bucket. The database is an RDS instance so we get some fun things built in there. Note: All of the AWS setup is also automated with scripts. Create VPC, create autoscaling group, create targets, create rds instances, create s3 buckets, create cloudfront, and delete all of the above (and more, aws is complex), are all just scripts.

Step three: Because we have a test suite and deployment scripts the rest of the process is easy. Just use the scripts to create whatever environment we need, stick it on a schedule, record the results in CI/CD, alert the WHOLE FREAKING WORLD if something doesn't work.

Now you might say, easy to say in a Rails environment, with so few moving parts, with such a new project, etc etc etc (I have heard every excuse in the book). I have done this for many other companies. The last I did it at had about 50 engineers, ran a large Java mixed bag of applications on Tomcat servers, ran Oracle for their data, and had no tests and a ton of legacy code. We got to the same point as I have already explained by simply breaking it into chunks. First, automate the tests that were done manually. Second, automate the deployment steps that were done manually. Third, automate the environment things that were done manually. Finally, schedule everything and monitor.

I learned to do this at HP Labs where we used the same process with a very large API fronting a C based image processing system with Petabytes of storage, thousands of servers, and a huge number of moving parts. I promise, it can work anywhere.

I've been thinking about this as well - it seems like it would fit in nicely with other CI jobs. With database backups, for example, you should be able to script the restore procedure and apply some assertions to check it worked. Bonus with this is that you now have a script when you actually need to restore.
I had a company that I was doing some work for come to me to ask for a copy of the database. Their backs were corrupt, and it was not until they tried to restore it did they find out. But they have 5 years of bad backups
I have a script that checks for data rot[1]. It's part of a more comprehensive backup system[2].

[1] https://medium.com/vantage/how-to-protect-your-photos-from-b...

[2] https://medium.com/swlh/my-automated-photo-workflow-using-go...

I'm a big fan of setting up testing and dev environments from the production backups.

For personal backup of files, I just verify the results are in place. I've checked them once or twice, but honestly, I'm more concerned about my scripts stopping running than they running and not being correct.

That is fine, providing you don't operate in a confidential or regulated environment. :/
Yes, many warnings do apply.
it's not a backup till you test it - just complicated wishful thinking.
As a colleague of mine says: You don't want a backup, you want a restore.
I'm using CrashPlan and I have recovered multiple files over past couple of years, that I either mistakenly deleted or overwritten. I haven't tried any full-scale restore, yet, though.
CrashPlan lost some data of mine in 2013 from querying a corrupted Volume Shadow Copy Services database on Windows. (At least, that was their explanation. I'm surprised that their client did not independently verify the data after it was uploaded.)

I moved off CrashPlan in 2016 because their upload speed continues to be embarrassingly slow outside the US even with deduplication and compression turned off (they have a datacentre where I'm at, but it's for Enterprise customers only).

They also highly recommend having 1GB of RAM for every 1TB backed up, which sounded a bit unreasonable to me.

What did you switch to? Problem with BackBlaze and others is that they delete backups if not connected for 30 days, particularly external drives.
I moved to Acronis True Image when they offered unlimited cloud backups with their 2016 version. They probably couldn't sustain it, because I had to pay a lot more for backups when I wanted to renew in 2017.

Now, I'm using both Arq and Synology's Hyper Backup with Amazon Cloud Drive. One of the problems I foresee is that while Amazon doesn't care how much data one stores in Cloud Drive, they have suspended users for downloading past an arbitrary limit in a certain period of time — so full restores might not be possible.

Do you work at Gitlab ? :)
too soon
".. do you actually TEST those backups?"

yes (of course), see my post above.

I have used time machine repeatedly to restore lost or damaged files. I also replaced harddrives several times and played back my carbon copy clone. It boots and I have never missed a file in years.

This is one thing i like about doing content addressed storage. I've been toying with my own implementation, quite similar to Camlistore.

The net result is it's super simple to verify an entire datastore as being valid or not.

I have restored files from BB multiple times. It is a great solution for non technical ppl or offices that have at least 24Mbps connections up/down.
what happened to the backblaze backup?
Interesting question. I wonder if it was prior to when BB moved to the "direct wire" architecture in Storage Pod 4.0.