Hacker News new | ask | show | jobs
by empthought 1326 days ago
This is a weird take; most systems in data centers don’t have people walking from rack to rack yanking power cords, and most consumer systems don’t even have a power cord to yank.
4 comments

While I agree it's a bit of a weird take, for example -- there may be performance tradeoffs made in any given workload to make the disk consistent, inconsistently

The 'most' there is doing some effort

It is actually quite a common practice for those being audited for disaster recovery to do exactly that -- yank cables. More realistically, flip some switches

We do it once a year, set aside a region and time... then test our processes

It serves a few purposes, most importantly -- are our services fault tolerant, and can we bring them back?

I think it's reasonable to trap the signals and make a best effort basis, knowing that PID 1 (or the environment) will eventually have to SIGKILL you -- ready or not

Just because we can't save all of the state doesn't mean we shouldn't try

Right, there are failure modes that have to be tested and accounted for, and one of them is the state being inconsistent after a shutdown.

The previous poster seemed to advocate for not thinking of this as a failure mode at all but rather normal operation, which I just don’t see as true.

This paper was influential with regards this idea: https://www.usenix.org/conference/hotos-ix/crash-only-softwa...

I don't think it's that unusual, but obviously there are tradeoffs.

Totally, it's certifiably untrue!

Take the InnoDB storage engine in MySQL/MariaDB for example.

For performance (and likely other) reasons, this file only grows. It never shrinks... it will only go to 0 or grow.

The DB (or individual tables, depending on config) have to be truncated/emptied to reclaim those blocks.

Stop it uncleanly and there's a good chance you'll have to sacrifice a considerable amount of the data just to get the engine to start

This and countless other things have to make consistency trade-offs. While everything could be written to only operate atomically, it will also slow to a crawl.

> and most consumer systems don’t even have a power cord to yank.

Some do. And the rest occasionally forcibly reboot (kernel panic or hardware failure), need to be manually forcibly rebooted (due to frozen UI), or unexpectedly loose battery, all leading to the same outcome. At least, that’s been my experience with just about every computer, phone, tablet, smartwatch, game console, and smart TV I’ve ever owned. Plus a number of routers. Is your experience different?

Weird take?

It is a table stakes expectation for most servers that they will not lose data when the power goes out, or when the kernel panics, or when the server itself crashes or runs out of memory. If your software requires graceful shutdown, that seems to imply that it will lose data in all those cases.

You can perhaps use graceful shutdown to perform some optimization that allows subsequent startup to go faster, e.g. put things in a clean state that avoids the need for a recovery pass on the next startup... but these days with good journaling techniques "recovery" is generally very fast. When that's the case, it's arguably better to always perform non-graceful shutdown to make sure you are actually testing your recovery code, otherwise it might turn out not to work when you need it.

So yeah, I agree with SoftTalker. Assume all shutdowns will be sudden and unexpected, and design your code to cope with that.

That software should not “even expect” a graceful shutdown is the weird take.

Servers can and do “lose data” all the time when they’re shut down unexpectedly. I don’t know why you’d think they don’t. If the data has been read from somewhere (a socket maybe) and not fsynced, it’ll be lost. I agree that the system needs to be designed in such a way that this is a recoverable state, but I disagree with the ideas that applications should not have a mode where buffers can accumulate for some period of time without being fsynced, and that there should be no attention paid to the common case, which is planned process stops (aka SIGTERMs) for a variety of reasons. System shutdown just being one of those reasons.

By "lose data" I mean losing a confirmed write. That is, the server got a request to modify some state, and it responded to the request indicating success, but the change is later lost. Generally it's expected that databases will not lose confirmed writes, unless the application has explicitly made the decision that this is acceptable and opted into possible data loss to improve performance.
I think in that situation they expect you to not ack the data before you called fsync (which is the same expectation most people have of their SQL database). Then the remote end can retry the operation.
They do. Reminds me of a previous job long ago where a datacenter tech was checking where a network cable went by tugging on it, resulting in a network switch blade being yanked out of a chassis, bringing down half of the production environment.

These thing did happen, can happen and will happen.

Even in modern cloud environments. AWS might consider the hardware your EC2 VM is running on unstable, prompting you to replace/move the VM within 24 hours (if it has not already brought down by hardware failure).