| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cwb71 5069 days ago

The part of this post that really blew my mind:

  We host our status site on Heroku to ensure its availability
  during an outage. However, during our downtime on Tuesday
  our status site experienced some availability issues.

  As traffic to the status site began to ramp up, we increased
  the number of dynos running from 8 to 64 and finally 90.
  This had a negative effect since we were running an old
  development database addon (shared database). The number of
  dynos maxed out the available connections to the database
  causing additional processes to crash.

Ninety dynos for a status page? What was going on there?

2 comments

wfarr 5069 days ago

At the time of the outage, the status site was seeing upwards of 30,000/req minute.

AS we scaled up dynos, we would see temporary performance improvements until the status site would stop responding again. In the short term, this led to us massively increasing dynos as quickly as we could as it appeared that CPU burn was a significant cause of the slowness (at the time). This was in part caused by all the dynos repeatedly crashing. That's how we ended up going from 8 previously to 90.

Once the database problem for the status site was identified and resolved, we began scaling down dynos to a smaller number.

link

ashray 5069 days ago

What prevented you from just caching the status page and then refilling the cache manually every X seconds ? I'm sure a status that is a few seconds old given the system wide meltdown wouldn't have been an unreasonable compromise ?

link

erichocean 5069 days ago

Or memcache, with one worker dyno dedicated to updating it, cron-like.

link

adgar 5069 days ago

30,000req/minute is 500qps. That's... just not a lot for a large service.

link

mbell 5069 days ago

Anyone tested S3's static page hosting under heavy load? I would think you could just update the static file as a result of some events fired by your internal monitoring process.

link

dustym 5069 days ago

We use S3 behind 1 second max-age cloudfront to serve The Verge liveblog. It's been nothing but rock solid. We essentially create a static site and push up JSON blobs. See here:

http://product.voxmedia.com/post/25113965826/introducing-syl...

link

spicyj 5069 days ago

This is really interesting -- thanks for sharing. It seems to me that you could probably have nginx running on a regular box and then CloudFront as a caching CDN to avoid the S3 update delay.

link

dustym 5068 days ago

Probably could figure that out, yeah. But we didn't want to take any chances given how important it was to get our live blog situation under control.

[edit]

Which is to say, we wanted a rock solid network and to essentially be a drop in a bucket of traffic, even at the insane burst that The Verge live blog gets.

link

donavanm 5068 days ago

Could you say more about using both the Cache-Control and a query string of epoch time? In particular the query string has me puzzled. On it's face it seems to decrease your cache hit ratio, with no/little benefit. Im assuming the epoch time is the clients local time. The clock skew across the client population increases the number of cache keys active at any one time. The incrementing query string also forces a new cache key once per second. Those would force a cache miss and complete request to S3 even when content has not changed. It's even worse with the skew as you now force a cache miss per second for each unique epoch time in your client clocks. Without the query string the cache could do a conditional GET for live.json. That would save latency & bytes as the origin could respond with a 304 instead of the complete 200.

link

dustym 5068 days ago

Great point. I don't speak for the guys that made the decision to append the timestamp to the query, but I assume our concern is in intermediate network caches that don't honor low TTLs. Though I don't know how founded that is, we won't ever have to deal with the issue if we take control of it with the url string.

It'd be interesting to see how wide the key space is due to clock skew. I suppose we could specify some number and consider it a global counter that is incremented every second, then when someone comes in for the first time they can by synced in with the global incrementing counter. That counter is used to ensure a fresh cloudfront hit.

I think at end of the day, these issues haven't been a huge concern for a one month emergency project, but they are good points.

link

WestCoastJustin 5069 days ago

S3 is great for static content. I was taking the AWS ops course and the instructor mentioned some very large organizations redirect their site to S3 when under DDOS so they can remain on-line. In fact, he said that AWS recommended this solution to them?! Can you fathom someone who is under DDOS, and you tell them, hey, just redirect that our way ;)

link

fierarul 5068 days ago

You pay for the bandwidth on AWS. Of course they would be glad to redirect a DDOS their way. It's pure gold for AWS.

link

moe 5069 days ago

"Heavy load"?

30 kRPM is 500 hits/sec. Nginx will serve >2000/sec from a m1.small. For S3 that is about the equivalent of a mosquito fart.

link

biot 5069 days ago

Use Jekyll and push the site to S3:

https://github.com/mojombo/jekyll/wiki

https://github.com/laurilehmijoki/jekyll-s3#readme

link