Hacker News new | ask | show | jobs
by macintux 793 days ago
Jim Gray wrote a classic paper about fault tolerance that I often reference when talking about Erlang: Why Do Computers Stop and What Can Be Done About It?

http://jimgray.azurewebsites.net/papers/tandemtr85.7_whydoco...

1 comments

> In the future, hardware will be even more reliable due to better design, increased levels of integration, and reduced numbers of connectors

I couldn't help laughing at that

He did an unofficial follow-up report[0], based on Tandem customer data from 1985-1989. He mentions the big improvements in hardware (at least for Tandem Computers) were the switch to VLSI logic, hard disks that didn't require any maintenance, and the use of fiber optic connections.

I still find Tandem NonStop Systems interesting, and they're still being sold by HPE running on standard x86 servers.

[0] https://jimgray.azurewebsites.net/papers/TandemTR90.1_WhySto...

Better design enabling rowhammer, meltdown, and the like...

But when it comes to failures I would bet things must have improved when you measure failure per operation.

Computers did not fail often 30 years ago. If they failed orders of magnitudes more nowadays we would definitely notice.

I have absolutely no numbers on reliability in any kind of metric.