Hacker News new | ask | show | jobs
by Slavius 3136 days ago
I can't think of single useful piece of software nowdays that is exposed to public and can't run in active-active load balanced or clustered scenario. If your kernel/system/userland-app misbehaves it simply needs to be shut down, reported and examined. It might have been some random memory block the last time your app made an buffer overflow, but it could as well be the stack pointer next time...
3 comments

Remember, we're necessarily just talking about servers here; every single hospital has mission-critical client machines that cannot go down and obviously those aren't load balanced or clustered. (Though mostly they seem to be running Windows.)
For safety-critical systems, resetting on a fault is very much factored into the worst-case response time and expected behaviour.

PANIC on fault is exactly what you design into the systems.

So why is not the world running on C64s?
The world does run on microcontrollers that have roughly the same processing capability as a c64...

The vast majority of processors sold are not i5 or i7 level but microcontrollers.

Web browsers are, for practical purposes, exposed to the public. Linux doesn't run only on servers.
So what happens when your browser crashes? I experience that on a regular basis. Id' rather have my browser crash/killed instead of slowly overwriting my filesystem buffers or corrupting my stack pointer... Other than that browser are multi-thread/process applications. Usually only a single tab or a plugin crashes unless core browser process is affected. Most users would accept the trade off between crashed browser and infected/corrupted system.
> Most users would accept the trade off between crashed browser and infected/corrupted system.

Most users are using computing devices a means of getting stuff done. They don't want to spend any energy thinking about how their software works, they want their devices to be invisible, which they use to run their Apps uninterrupted. The trade-off is whether to let Apps continue running vs hard crashing and taking down all the work they've done and all the mental energy and focus invested up to that point. If their Apps frequently crash most users aren't thinking, well I'm super glad the hours I spent on this paper I'm working on is now lost, the phone calls to my loved ones or movie I'm watching are abruptly terminated because someone's policy on hard crashing when a bug is found has been triggered. Their preferences and purchasing power are going to go towards non user-hostile devices they perceive provide the best experience for using their preferred Apps without any need for pre-requisite knowledge of OS internals.

There's not a single computing device that frequently crashes as a result of security hardening that will be able to retain any meaningful marketshare. Users are never going tolerate anything that requires extraneous effort on their part into researching and manually applying what needs to be done to get their device running without crashing.

Apps are supposed to keep their state either by saving your work regularly to persistent media or keeping your data off-client. We're living in 21st century in a cloud era FFS.

Keep running your app although integrity corruption within the application happened is putting user data at risk. IMHO an application that corrupts 3 days long presentation file save is to every user more frustrating than the one that crashes due to error leaving you with 5 minutes of unsaved changes lost.

Microsoft have invented "Application Recovery and Restart" exactly for this purpose.

> Keep running your app although integrity corruption within the application happened is putting user data at risk.

If user data is continually backed up to a remote site it's not going to be at risk from a local bug is it? Bugs exist in all software, Users are going to be be more visibly frustrated from their Apps frequently crashing then the extremely unlikely scenario where a detected bug corrupts their "3 days long presentation". They're going very unhappy if the cause of their frequent data loss was due to a user-hostile setting to hard crash on the first detectable bug.

> Microsoft have invented "Application Recovery and Restart" exactly for this purpose.

From Microsoft website:

> An application can use Application Recovery and Restart (ARR) to save data and state information before the application exits due to an unhandled exception or when the application stops responding.

- https://msdn.microsoft.com/en-us/library/windows/desktop/cc9...

i.e. restarting Apps due to "unhandled exception or when the application stops responding" in which case the App is in an unusable state and ARR kicks in to try auto recover it for minimal user disruption. The focus on providing a good UX, not a miserable crash-prone experience where users use their devices in fear that at anytime anything they're working on can be terminated abruptly without warning.

You clearly have limited view on application bugs. Let me elaborate a bit on bugs causing application dissatisfaction and UX frustration without crashing much, much worse than a simple error message along the lines: "OS has terminated application X because it has performed an illegal operation."

Data corruption - reading or writing corrupted data - files cannot be read, saved files get corrupted, API calls from/to external applications/systems fail or pass incorrect data Rendering problems - corrupted images, incorrect colors, improper content encoding, visual stuttering, audio deformation, audio skipping Input/output lags - unregistered kaystrokes, missed actions and responses to external events, mouse stuttering and misbehavior Improper operation - inconsistent results - repeated rendering yields different results (html), formulas/calculation results in data is inconsistent (excel, DWH) Access violation - access gained to invalid or protected areas - unprivileged access, license violations, access to areas protected by AAA, data theft (SQL injection, database dumps)

and others. If I figure out the application I'm using (web-browser) allowed a hacker to steal my data he would not have otherwise access to I would be more pissed off than if it crashed and I found an error about it in system log.

Meta: Who, and why, flagged this comment? What rule exactly Slavius breaks here?

On topic: I can't recall now the details, but I read a paper once about a system which had no shutdown procedure at all, the only way to exit it was to crash it somehow or just shutdown the computer. The system made sure to save everything often enough and made sure to store the data in ways which allowed for restoring possibly corrupted parts of it on the next startup. This design produced a very resilient architecture which worked well for that use case.

The paper was from '80s or '90s, so it's not like we need to be in 21st century to design that way. I'll try searching for the paper later.

You might be thinking of KeyKOS, and of the anecdote which can be found at https://lists.inf.ethz.ch/pipermail/oberon/2010/005734.html (it should also be at the EROS homepage, but it's down for me at the moment).

See also: "Crash-only software" https://lwn.net/Articles/191059/

The flagger probably was uncomfortable with "FFS". After all colorful expression is bad for HN. b^)

What you're talking about seems like crash-only with Erlang/OTP.

> or corrupting my stack pointer...

in that case, it will crash with a SIGSEGV sooner or later anyway

...or is being remotely exploited and it silently succeeds. Who wants that?
That is very unlikely. Crashing would happen 100% of the time though. Most people want that trade-off (meaning: If their browser would crash, they would switch to another one, even it was less secure).
Stack pointer manipulation is the entry point for an extremely large subset of security issues.
Corrupting SP is part of almost every exploit and I can guarantee you that it is very likely (going to cause harm on your system). Try to pull Metasploit GIT repo to get some idea about thousands of payloads that do corrupt SP without crashing the host...
Never had a single problem take down all of your instances at once, eh?