| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wfarr 5067 days ago

At the time of the outage, the status site was seeing upwards of 30,000/req minute.

AS we scaled up dynos, we would see temporary performance improvements until the status site would stop responding again. In the short term, this led to us massively increasing dynos as quickly as we could as it appeared that CPU burn was a significant cause of the slowness (at the time). This was in part caused by all the dynos repeatedly crashing. That's how we ended up going from 8 previously to 90.

Once the database problem for the status site was identified and resolved, we began scaling down dynos to a smaller number.

2 comments

ashray 5067 days ago

What prevented you from just caching the status page and then refilling the cache manually every X seconds ? I'm sure a status that is a few seconds old given the system wide meltdown wouldn't have been an unreasonable compromise ?

link

erichocean 5067 days ago

Or memcache, with one worker dyno dedicated to updating it, cron-like.

link

adgar 5067 days ago

30,000req/minute is 500qps. That's... just not a lot for a large service.

link