Hacker News new | ask | show | jobs
by trcollinson 3404 days ago
Excellent question. I do test my backups and restores on a rather constant basis. Each environment within my infrastructure takes a bit of a different approach.

Application

This is by far the easiest for me to test. We have a CI/CD jon which literally makes a new environment, from scratch, and deploys our application to it in a production configuration. It runs a test suite which tests functionality across the application. Finally, it destroys the environment. It reports on each portion of the process. In this way we know exactly how long it would take to redeploy the entire application from scratch on a new infrastructure and get it up and running. This morning it took about about 6 minutes total before tests ran.

Database

We are running an RDBMS. We use a combination of daily full backup, incremental transaction log like backup, and point in time backup. Again, in our CI/CD when a full backup is taken it is pulled, loaded, and a test routine is run against it to check integrity. At this time, the recovery from the day before is destroyed. When a transaction log backup is made, CI/CD picks up this change and applies it to the full backup restore and runs a set of tests for integrity check. This leaves us with a warm standby ready to be switched over to in case of the main database server going down. We have never had to use the warm standby in an emergency but we have a test to make sure we can cut that over as well.

For point in time backup testing this goes back to our application test above. The application test will spin up with a point in time recovery of the database backup. It will test the integrity of that recovery and then test the application against it. Finally, it will swap from the point in time recovered database to the warm backup. It runs the test suite against that for integrity as well.

File Store

People often forget this but those buckets that get hold all of your file storage in the cloud can be destroyed so easily (sad, sad experience taught me this). We test those as well. I am sure you can guess at this point how we do that? CI/CD. It's a rather simple process with a ton of gain.

A few notes

People always ask me this, so I will answer it first. Yes this costs money. It's not as bad as running a second production environment. But it will cost you a bit. My follow up question is, how much does downtime cost you?

My CI/CD is always Gitlab CI at this point. I've used Jenkins. I've used Travis. I like Gitlab CI. You can do all of this with any of those.

We script literally everything. Computers are so good at repetitive tasks. Why would you EVER do anything manually? Really. If it has to do with your infrastructure, script it.

If anyone has any questions about these ideas, feel free to reach out.

1 comments

How many (full time) devs and how long did that set up take?
We currently have 4 full time devs, a QA, a DBA consultant, and a Designer on the team.

Honestly, none of that took very long to set up at all. The application in this case is a Ruby on Rails backend, PostgreSQL database, Angular front end, with file storage and a few other smaller services.

Step one: We have a lot of tests and we believe in a good test suite. Are we perfect? Absolutely not. But it is important to be able to "know" the application works. Define what helps us to know it works, and automate tests to do that. Things like "Can you log in?", "Can you select a record of type X, Y, Z, A, B, and C and do those records have the data you would expect in the right places?" You can have a human do this, or you can automate it. Automate it.

Step two: Automate your deployment. The rails application is bundled into a docker container. We use ECS (Elastic Container Service) to maintain our environments. CI/CD first runs tests, second, builds the latest docker container, third, places the docker container into a repo, fourth , deploy out the container to the correct ECS environment, five, profit! This is all automated and works the same every time with checks and balances along the way. Our Angular application runs out of S3 buckets with cloudfront caching. This was a matter of using webpack to compile the angular application down to production deployable artifacts and than a simple bash script to move those artifacts to the S3 bucket. The database is an RDS instance so we get some fun things built in there. Note: All of the AWS setup is also automated with scripts. Create VPC, create autoscaling group, create targets, create rds instances, create s3 buckets, create cloudfront, and delete all of the above (and more, aws is complex), are all just scripts.

Step three: Because we have a test suite and deployment scripts the rest of the process is easy. Just use the scripts to create whatever environment we need, stick it on a schedule, record the results in CI/CD, alert the WHOLE FREAKING WORLD if something doesn't work.

Now you might say, easy to say in a Rails environment, with so few moving parts, with such a new project, etc etc etc (I have heard every excuse in the book). I have done this for many other companies. The last I did it at had about 50 engineers, ran a large Java mixed bag of applications on Tomcat servers, ran Oracle for their data, and had no tests and a ton of legacy code. We got to the same point as I have already explained by simply breaking it into chunks. First, automate the tests that were done manually. Second, automate the deployment steps that were done manually. Third, automate the environment things that were done manually. Finally, schedule everything and monitor.

I learned to do this at HP Labs where we used the same process with a very large API fronting a C based image processing system with Petabytes of storage, thousands of servers, and a huge number of moving parts. I promise, it can work anywhere.