Hacker News new | ask | show | jobs
by ipsocannibal 2033 days ago
So the cause of outage boils down to not having a metric on total file descriptors with an alarm if usage gets within 10% of the Max and a faulty scaling plan that should of said "for every N backend hosts we add we must add X frontend hosts". One metric and a couple of lines in a wiki could have saved Amazon what is probably millions in outage related costs. One wonders if Amazon retail will start hedging its bets and go multicloud to prevent impacts on the retail customers from AWS LSE's.