Hacker News new | ask | show | jobs
by mmartinson 2280 days ago
The sort of crash looping that took down their nodes is one of the parts of OTP that I’ve found to be both quite unintuitive and dangerous, for exactly the reasons described. There are controls for max retry attempts within a time window, but no obvious way from what I can tell to require a supervisor to halt rather than propagate a crash loop in a way that is easily auditable. This has caused me bad failures as well, and I’d be curious to hear what others are doing here. It’s not that it’s a hrs problem to solve, but that it’s an easy thing to overlook.
1 comments

Max restarts is a big sharp edge. I think anybody running Erlang in production must have hit it, probably a couple times. For things that I expect to fail regularly, but have no business bringing the system down, I reflexively sleep in the init, so it would only trigger max restarts if I really messed up. If it's consistently failing, it'll log enough to be visible, but it's not like this is an ideal way to handle it.