Hacker News new | ask | show | jobs
by kennystone 5125 days ago
Quite a few Erlang gotchas in those notes. Fault tolerant systems are really hard to design even when you know what you're doing and are using the best language for it (Erlang). Erlang aside, it seems the higher level architecture may need a rethink if one bad record can bring down the whole thing.
1 comments

It looks like the error recovery code wasn't well tested. Error recovery code in distributed systems is some of the hardest code to test effectively mind.

The thundering herd of recovery is especially difficult to cope with: your error recovery code can work just fine for normal outages but then fail completely when faced with just a few more components going dark.