Hacker News new | ask | show | jobs
by jacquesm 1050 days ago
> I think it's very hard to run a service by yourself of this magnitude reliably, but I'd always take a 99.9% availability daily backup service that runs right at SLO over one that's down for a day once in a blue moon.

That's a fallacy right there. Your assumption should be that any service you rely on will be down once in a blue moon, and possibly for a day or even longer.

> Also, parent is talking about ingestion. If your backups aren't configured well and the backup process fails, then your backup may not end up durable.

Yes, indeed, you need to do your work and you don't get to point at others for not doing it right.

> I also don't think your definition of reliable is generally recognized, which I'd generally call durability.

Reliability, durability and availability are all industry terms and have very clear definitions. These are not the same definitions that you would use in ordinary conversation with laypeople but when we're talking shop those are definitely allowed.

> I wouldn't say the scenario above is a durability failure, but an example of the consequences of poor availability.

No, it is a consequence of poor engineering on the part of the user of the service, and is a completely different issue. You engineer your service to ensure that your assumptions hold true and if you fail at doing that your service will fail. When is then only a matter of time and combination of circumstances, but fail it will.

2 comments

> No, it is a consequence of poor engineering on the part of the user of the service,

The entire service going down for 24 hours due to a reboot is not a consequence of poor engineering on the part of the user. A production service which people rely on for critical data failing on the _textbook_ example of running a live service is poor engineering on the services part.

I've seen entire datacenters and many services go offline due to 'minor mishaps' and that was stuff done by the largest companies on the planet. If you don't account for failure of underlying infra + services you are not doing it right.

Tarsnap makes very particular guarantees, if you look into that then you'll realize that for some applications it is very useful and for other applications it is not, or that you may have to use not one but multiple backup services to be able to serve all your needs. This can be costly.

Tarsnap doesn’t help you with using the service well, if you need to implement your own retries, and if the docs tell you to just write a single-line shell script and call it from crontab [0].

[0] https://www.tarsnap.com/simple-usage.html