| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cpitman 1513 days ago

Great post. This one always brings a smile to my face:

> Every component is crash-only

I was part of the team that developed a distributed, five-9's control system for an industry where downtime costs millions per minute and comes with a federal investigation if long enough. On top of that, the industry is made up of competitors that explicitly distrust each other, so all components had to be truly distributed, with no central coordination for anything.

Given the requirements we decided to explicitly adopt a crash-only approach. Between idempotent operations, horizontal scaling, and fast restart times, we could make failing components not impact SLAs (and we had testing to ensure it).

Once it gets out into the field (which because of how risk adverse this industry is, is measured in years), it turns out they really did not like software crashing. They interpreted crashing as bad quality, and no amount of "we do it on purpose to ensure correctness" was going to make them happy.

5 comments

WJW 1513 days ago

Rebrand it as "fault tolerant" and/or "adverse environment certified" and you should be good to go. That's how they do it in the military sector at least.

link

__alexs 1513 days ago

It's not a crash it's a runtime state rollback.

link

caffeine 1513 days ago

> They interpreted crashing as bad quality

The solution here is to rebrand it with some vague euphemism:

“Ah yes the component underwent a state calibration”

link

fh973 1513 days ago

The term you're looking for is "software rejuvenation".

Jokes aside, there is even a body of research papers around this subject, if you need some backing.

link

quickthrower2 1513 days ago

I guess you were working for a stock exchange

link