Hacker News new | ask | show | jobs
by Kwpolska 1050 days ago
Reliability is important for a backup service. If your machine explodes and you need to restore from backups, but the backup service is down, you need to wait and may lose money due to the outage (SLA, unhappy customers, no ability to onboard new customers etc.). If you’re doing weekly backups, but the backup service was down during the backup slot, and your crontab setup doesn’t yell at you and doesn’t retry until it succeeds, you might lose two weeks’ worth of data if disaster strikes.
1 comments

Yes, reliability is important. And by that measure Tarsnap is 100% reliable. But not 100% available, and that's something that often gets confused. Having to wait while you are trying to restore a backup would be extremely annoying but that implies that you've done something wrong in your planning: if you expect your backup service to be 100% available then you are probably not engineering things right because for many reasons that might not be the case. Tarsnap does not promise 100% availability, and no other backup service that I'm aware of does. For instance, backblaze offers 11 (!) nines reliability but only 3 nines availability (which is pretty much expected).

If you want more than 3 nines availability neither Backblaze nor Tarsnap nor any other outside service would be able to serve your needs.

I think it's very hard to run a service by yourself of this magnitude reliably, but I'd always take a 99.9% availability daily backup service that runs right at SLO over one that's down for a day once in a blue moon.

Also, parent is talking about ingestion. If your backups aren't configured well and the backup process fails, then your backup may not end up durable.

I also don't think your definition of reliable is generally recognized, which I'd generally call durability. I wouldn't say the scenario above is a durability failure, but an example of the consequences of poor availability.

> I think it's very hard to run a service by yourself of this magnitude reliably, but I'd always take a 99.9% availability daily backup service that runs right at SLO over one that's down for a day once in a blue moon.

That's a fallacy right there. Your assumption should be that any service you rely on will be down once in a blue moon, and possibly for a day or even longer.

> Also, parent is talking about ingestion. If your backups aren't configured well and the backup process fails, then your backup may not end up durable.

Yes, indeed, you need to do your work and you don't get to point at others for not doing it right.

> I also don't think your definition of reliable is generally recognized, which I'd generally call durability.

Reliability, durability and availability are all industry terms and have very clear definitions. These are not the same definitions that you would use in ordinary conversation with laypeople but when we're talking shop those are definitely allowed.

> I wouldn't say the scenario above is a durability failure, but an example of the consequences of poor availability.

No, it is a consequence of poor engineering on the part of the user of the service, and is a completely different issue. You engineer your service to ensure that your assumptions hold true and if you fail at doing that your service will fail. When is then only a matter of time and combination of circumstances, but fail it will.

> No, it is a consequence of poor engineering on the part of the user of the service,

The entire service going down for 24 hours due to a reboot is not a consequence of poor engineering on the part of the user. A production service which people rely on for critical data failing on the _textbook_ example of running a live service is poor engineering on the services part.

I've seen entire datacenters and many services go offline due to 'minor mishaps' and that was stuff done by the largest companies on the planet. If you don't account for failure of underlying infra + services you are not doing it right.

Tarsnap makes very particular guarantees, if you look into that then you'll realize that for some applications it is very useful and for other applications it is not, or that you may have to use not one but multiple backup services to be able to serve all your needs. This can be costly.

Tarsnap doesn’t help you with using the service well, if you need to implement your own retries, and if the docs tell you to just write a single-line shell script and call it from crontab [0].

[0] https://www.tarsnap.com/simple-usage.html

That's a funny definition of "reliable". I'd factor availability into reliability. If I Uber to work and every time an Uber picks me up it gets me to my destination with 100% success but once a week no Ubers are available, is that a reliable mode of transportation? Would my boss not shout at me to find a more reliable way to get to work?
Eric Brewer's calling, and would like a word.

Availability and correctness are fundamentally opposed. The word "reliable" is contextual.

A backup service that is always available but serves up garbage is not as reliable as one that serves me the correct data, but only on Mondays.

Sure, but if you took urber every day and after several years none was available for just one day your boss would forgive you and consider uber reliable. If it suddenly had a lot of failures you would be told to find a new way, but everyone has a few days per year they can't get to work(often sick)