Hacker News new | ask | show | jobs
by wfarr 5019 days ago
At the time of the outage, the status site was seeing upwards of 30,000/req minute.

AS we scaled up dynos, we would see temporary performance improvements until the status site would stop responding again. In the short term, this led to us massively increasing dynos as quickly as we could as it appeared that CPU burn was a significant cause of the slowness (at the time). This was in part caused by all the dynos repeatedly crashing. That's how we ended up going from 8 previously to 90.

Once the database problem for the status site was identified and resolved, we began scaling down dynos to a smaller number.

2 comments

What prevented you from just caching the status page and then refilling the cache manually every X seconds ? I'm sure a status that is a few seconds old given the system wide meltdown wouldn't have been an unreasonable compromise ?
Or memcache, with one worker dyno dedicated to updating it, cron-like.
30,000req/minute is 500qps. That's... just not a lot for a large service.