Hacker News new | ask | show | jobs
by kgo 5811 days ago
Eventually your hardware will fail. If you can't even safely reboot your machines under controlled circumstances, you've already lost the uptime battle.
3 comments

And that's one point for specialized hardware. On a zSeries mainframe, CPU errors are detected on-the-fly, faulty CPUs are deactivated and their processes migrated to functioning CPUs.

They cost a lot, but they deliver a lot of confidence too.

I have run into problems where the machine would not survive reboot after updates but did not find this out after months.
It's OK to reboot from time to time. What is not OK is to have a reboot imposed on you when you would rather continue running. It's not a huge disruption to reboot a cluster node, as long as the rest of the cluster takes the load.

Rebooting makes sure the filesystem is properly scrubbed, temporary files are removed and any stale data in memory gets removed.

Uptime competitions are pointless.

But forced downtime (Windows Update-style) is unacceptable.

> Uptime competitions are pointless.

While uptime competitions don't indicate availability very well, they do show how much time happened since the last kernel crash or the last kernel security hole that required a kernel upgrade and thus a reboot.

(Unless, of course, someone is trading security for uptime, which is luckily the exception rather the norm, at least among responsible admins.)

It appears that neither Windows nor Linux work particularly well here, but the BSD systems are quite impressive in that regard, especially OpenBSD.

Good thing in ksplice is that not all kernel updates will require a restart. I see the Ubuntu folks got very bold in pushing new kernels down the update pipe in the last couple releases.

Availablity is also somewhat overrated. Like I said, it's not the downtime that kills you, but the forced, unpredicted downtime.

> the BSD systems are quite impressive in that regard, especially OpenBSD

From my experience with OpenBSD, this appears to be quite simply because things have been pruned so well, there's nothing to fail. What is there is well-designed, and IIRC my general purpose install was ~200MB.

The insanely lightweight and simple nature of OpenBSD is one of my favorite things about it, and probably one of the biggest contributing factors to its strengths.

Eventually parachutes need to be used. Now, they should work and save you, but why risk it if you don't have to?
Because if you don't exercise that functionality, you will never have confidence that it will work when you need it to.
Worse - you might have false confidence that it will work, but you find out at the worst possible moment that it won't.