Hacker News new | ask | show | jobs
by dale_glass 1118 days ago
A machine staying up for almost 3 years is irresponsible in this day and age.

Yeah, I remember people having uptime competitions on Slashdot and the like some decades back, but you only need to look at the ssh logs of a 5 minutes old machine to realize this is a terrible idea in modern times.

8 comments

I dont understand opinons like this

Just because it would be dangerous for your nodejs web_app.exe running on ubuntu behind apache fully exposed on the internet

then there are billion other ways to use computers, like even air gapped systems.

So, dont try to justify obvious flaw

I mean, hardware is cheap enough that any server of importance should be individually disposable.

Yeah, you can do stuff to maximize uptime but if it needs to stay up that badly you have to consider the case of the hardware needing to be turned off at some point.

> So, dont try to justify obvious flaw

I'm not, it's a bug and should be fixed. But I think if anything is powered for 3 years straight it's a bit concerning.

Otherwise you're liable to find things like that somebody started something by hand 2 years ago, and at a critical moment nobody quite remember what the command was.

You live in your own World with other people. Please just keep in mind there are many other Worlds with other people and laws of the Universe.

I don't know if you're young or don't know much about history but what you describe is a fairly recent way of looking at things, it's not the only one and I guarantee you it will become "out of fashion".

Yeah, the "cattle not pets" philosophy is fairly recent, but I don't see it changing any time soon. If anything we're going even more in that direction.

And it makes a lot of sense because if uptime is that important, then no matter how fancy the hardware it can't do anything about disasters or losing internet connectivity.

We might go so far in that direction we wind up right back on the other side. It always happens, it’s more of a pendulum swinging back and forth than any kind of straight forward progress you are imagining.
The 7002 seems like it could be used in a workstation, where the “cattle vs pets” thing is less of a distinction, right? (I guess a workstation is sort of like a work dog in this analogy).
That's the EPYC lineup, which is the server model. Support for terabytes of RAM, 128 PCIe lanes, that sort of thing.

I mean you could use it in a workstation, but unless you need 4 video cards locally it's probably overkill for most uses.

And a workstation should have no problem rebooting once in a while.

As an additional data point -

I have ~1000 7002 cores in my home DC (8 dual socket R7525s with 48-64 cores each) that run kubernetes but are connected to a battery backup and use kexec to perform upgrades. So, while I am very bought into the cattle not pets philosophy, it's rare that any of these machines need to be turned off and I could see them being on for three years continuously without problem otherwise.

> But I think if anything is powered for 3 years straight it's a bit concerning.

Pretty much why Pawsey has an Annual High Voltage inspection shutdown [1]

> Otherwise you're liable to find things like [..]

TBH that's not really been an issue of note at any of the big iron farms I've been around since the 1980s .. generally there's a disciplined approach to maintaining 24/7/365 operation (that includes scheduled downtime for equipment checks) part of which is process documentation and justification and soft means of freezing | migrating processes+data etc.

[1] https://status.pawsey.org.au/incidents/tk5n5y965r5j

Individually disposable, yes. But if you have a cluster of those, and you powered them on at the same time -- as it often happens -- you're in for an exciting ride when your servers start rebooting almost simultaneously, give or take a few minutes.
3 years is irresponsible? To quote Logan Roy, you, software developers, "are not serious people" [1]. Just out of curiosity looked for a list of longest running electrical devices [2]:

    1840 - The Oxford Electric Bell
    1871 – Souter Lighthouse in South Shields, UK
    1896 – The Isle of Man’s Manx Electric Railway
    1902 – The Centennial Bulb
Apparently, "The Centennial Bulb has seen just two interruptions: for a week in 1937 when the Firehouse was refurbished, and in May 2013 when it was off for nine and a half hours due to a failed power supply."

[1] https://www.youtube.com/watch?v=LZTaXjt2Ggk

[2] https://www.drax.com/electrification/4-of-the-longest-runnin...

yes 3 years without hardware reset of a component not designed for long term high reliably use is irresponsible (the are servers fir very high reliability, they are just WAY more expensive)

BUT this doesn't mean you need to have downtime, in the same way a train unit in a railway system going through maintenance doesn't mean your railway system has downtime.

Redundancy is a must have feature for reliable systems and that means you system must be able to cope with random hardware failure or rebooting a server unit.

And both planned and unplanned maintenance of components are important normal business which in a well desingned reliable system should not lead to downtime.

Similar testing failure cases is important and should be done.

so either you don't run a high reliably system (and likely don't run into this bug ever), or you run a proper reliable system (and it's not a big deal), or you run a badly desingned or operated system pretending to be high reliably but but really being that... which is irresponsible (if you are aware)

Those are completely trivial complexity-wise compared to a modern server, and many don't have a real function, and mostly are artificially maintained as a curiosity.

I mean, the centennial bulb barely glows, that's why it still works. The hotter the filament gets the faster it evaporates, so a light bulb that barely makes any light can stay working forever.

Sure, was looking for electrical devices, a better example of what great engineering can achieve I suppose it's the Pons Fabricius [1], bridge built 2,085 years ago, still in use.

The problem is, if we can't expect software to run essentially forever, to update without 'restarts', and so forth, how are we ever going to achieve neural chip implants, artificial organs, synthetic agents mining ore in outer space, and so on? Software is not a gear mechanism, a rack and pinion, there is absolutely no reason to restart an 'operating system' or to ever lose state, however we became accustomed and we commit these sort of crimes daily, restarts and refreshes.

[1] https://en.wikipedia.org/wiki/Pons_Fabricius

Don't get me wrong, I'm not saying it's not a problem. It should be fixed.

But if you need a single system to stay up for 3 years straight that's probably not good. There's too much going on in a modern high tech server for that to be a good idea. Everything has a CPU in it (including disks, video cards, network cards, etc). And any of that could make your system unusable by hitting some rare condition.

> The problem is, if we can't expect software to run essentially forever, to update without 'restarts', and so forth, how are we ever going to achieve neural chip implants, artificial organs, synthetic agents mining ore in outer space, and so on?

I would hope such things to be purpose-made and to be made in a way that the user can survive a reboot/firmware update. Eg, your neural implant should be built in such a way that it's not going to be life threatening if the battery runs out. The system has to be designed with that accounted for.

Maybe there's a secondary, minimal implementation acting as a backup and keeping critical functions working while the fully featured one is being updated. Hopefully everything is implemented in a failsafe way so that if it completely stops working you're not in a worse state than before you got it.

Any plan where there's a crucial component that must not stop even for a second isn't a very good plan.

"Any plan where there's a crucial component that must not stop even for a second isn't a very good plan."

Our bodies, just think of our hearts or lungs, don't stop for even a second for 80 something years, and even that 80 is most probably arbitrary with very few changes in cellular control (instead of cancer, cooperate; instead of scar, regenerate [1]). No current software artifact can boast with such a performance. That's the main issue, our technology does not establish a hierarchy of competence [2], where each layer is independently able to solve problems such as the cell-tissue-organ-organism continuum. We must start digitizing the material, assemble assemblers that can assemble themselves [3].

[1] Dr. Michael Levin: Xenobots, Limb Regeneration, and The Power of Cellular Communication, https://www.youtube.com/watch?v=H_TyON2xWeQ

[2] Michael Levin, What do bodies think about?, https://www.youtube.com/watch?v=CVr1OkDqnmo "Nested Cognition, not Merely Structure" starts at 4:32

[3] Neil Gershenfeld, How to Make Almost Anything, The Digital Fabrication Revolution, http://cba.mit.edu/docs/papers/12.09.FA.pdf

Our bodies actually have a good amount of redundancy.

The cardiac pacemaker (as in the tissue that sets the heart rate) is redundant. There's a primary and a secondary, and both are made of many cells which can take some damage and the entire system will still work.

You don't need full system resets to get security updates. Kexec, live patching, userspace reboot.
> A machine staying up for almost 3 years is irresponsible in this day and age. [...] but you only need to look at the ssh logs of a 5 minutes old machine to realize this is a terrible idea in modern times.

You don't need to reboot a machine to update ssh.

You only need to reboot the machine to update the kernel; for everything else, you just have to restart the corresponding user-space processes (and even PID1 can re-exec itself). Most kernel vulnerabilities are not remotely exploitable, so as long as you can trust your user-space processes (and keep them updated), it should be safe enough.

As I recall, machines made by Tandem Computers, among other highly fault tolerant machines that have regrettably fallen out of fashion, didn't have to reboot even to replace the kernel. They didn't run Linux, tho.
Air gapped machines and kernel live patching both exist.
And how many people use that? Most servers today are not air-gapped.
How many examples will you need before you say "oh ok, I can see some valid concerns."?

I've worked in places where expensive Lab equipment is running off outdated PCs/servers because updates aren't available and they will absolutely stay on for as long as possible.

We're not all silicon valley, things can be expensive and difficult to replace...

I have kernel live patching on my mother's computer because it means she has to know how to do less
Most server don't do that, but those that do are not crazy
Are you perhaps a Windows user? In the Linux world updates don't necessarily require reboots.
Actually as of late, Linux has been moving towards rebooting for update.

Yeah, you technically can replace on-disk files while services are running.

In practice this can cause trouble if an application wants to read an updated file at the wrong time, and library dependencies can require restarting a lot of stuff.

For ages people would install an update containing a security fix in glibc or libz or something, and keep on running the vulnerable version of the services that use them.

At that point you might as well reboot.

Modern Fedora has a very Windows-like mechanism where you reboot to update. You reboot, the system installs updates, then reboots again.

While Fedora did move towards that, it's not the only way. A lot of systems which require high reliability are built to reload correctly.

At a generic system level, for example upgrading Nixos will pull new packages and put them next to the current ones, then reexec where possible. Nginx can replace its master process (SIGUSR2). Telephony software can often reexec and keep connecting open. Etc.

Outside of desktops it's not that uncommon to do seamless live reloads of the whole system.

I reboot after update just superstitiously.

Also out of superstition, I avoid hibernate -- when I walk away, it's either on and locked or shutdown. (I also did this on Windows; a mixed state just seemed off-puttingly and worryingly complex to me.)

Given what you said, and because I hear hibernation is notoriously buggy on Linux, both superstitions have rewarded me. :D

> Actually as of late, Linux has been moving ...

That's a pretty broad generalisation. Which distro's are you meaning?

KDE Neon has done this. Before I had to reboot anyway because usually the desktop was full of random crashes if I updated without rebooting.
Ahhh. As a first thought, that sounds like you could have restarted your desktop (eg logout -> login) without needing a reboot.

On a related topic, Ubuntu has an optional package that can be enabled to automatically restart the various systemd components that need it after their dependencies have been upgraded. From memory, that's specifically so people don't have to reboot unless it's really needed.

I don't remember the name of the package off hand though, but someone else here might... :)

I tried the logout/login dance several times, but it didn't always work. A reboot doesn't take much longer anyway so...
On Arch Linux atleast any external hardware device not already loaded by the kernel will fail to load after a kernel update
They do. Kernel and libs require it, unless you want to be unsure if your system is still reboot-safe
1042 days ought to be enough for anybody
Spoken like a true AWS user!
> 1042 days ought to be enough for anybody

"640K ought to be enough for anybody."

Not for a server
Not sure if he was sarcastically referencing "640kb should be enough for anything"
Kernel bugs are rare. Most (almost every single) vulnerability can be patched without rebooting.
They're not that rare. Also, there are a lot of other updates that in practice should be followed up with a reboot. For example, any library consumed by systemd (such as openssl) usually requires pid1 to relaunch. For example, debian released an openssl update just yesterday. You can run "checkrestart -v" to try to figure out how to restart every affected app but you'll quickly run into systemd's init process running with the old vulnerable library loaded, and then you might as well just reboot to get a clean "checkrestart -v". Even just relaunching non-pid-1 applications like dbus can quickly create a mess where sshd logins get a delay if you're not careful to also reload everything that depends on it.
> For example, any library consumed by systemd (such as openssl) usually requires pid1 to relaunch.

That does not require a reboot, `systemctl daemon-reexec` is enough.

nice username, can i ask you what did you see?