Hacker News new | ask | show | jobs
by nomilk 1743 days ago
It's a problem for any application on Heroku that uses more than a single web dyno (i.e. any software other than tiny applications), and certainly not limited to applications with long-running requests.

Consider Genius's checkout analogy: busy markets are efficient because customers sort themselves into the shortest queues. If customers were randomly allocated queues, some customers will wait in lengthy queues while others queues remain empty!

Critically, the same logic holds whether customers have lots of items (long-running requests), or very few items - one is slightly worse, but customers are still waiting unnecessarily while other queues are free to be used (but aren't).

So it's a problem even for applications with no long-running requests. Suppose a user's request falls in a queue with 10 users being served first, while other queues are idle, also assume requests are reasonably fast - 3000ms - then you still have to wait 33 seconds (!!) for the "3000ms" request to be served (!!) - I think that will error because heroku times out requests at 30000ms. This is a disaster for UX.

Someone made a simulation to show how incredibly wasteful it is: https://gist.github.com/toddwschneider/4947326

Note that while it would be great for it to be improved, I in no way demand/expect that, I just don't think it's fair that Genius aren't given credit for their analysis - analysis that Heroku ought to have provided. It shouldn't have been Genius or other customers to expose that Heroku had misguided them, even if it happened accidentally and without intent.

1 comments

I agree on most of this. My biggest disappointment is Heroku not implementing a reasonable form of autoscaling to mitigate the worst aspects of this if it occurs. Though I’d still contend that for a site like Rap Genius at the time, 3000ms server response times were still wholly unsatisfactory and they needed to get it under 100ms. Because the likelihood of this occurring compounds significantly in proportion to request time _and_ request volume. They knew what the required solution was.

The original “intelligent” routing was possible because the router effectively had a single shared state to track all these things which was also a single point of failure. And because the state of the art of web servers for Ruby back when it was implemented was Mongrel and it was single threaded.

Add in support for nodejs (and other multi-threaded languages to soon follow) and suddenly each dyno can process multiple request concurrently and the request accounting isn’t as valuable. Make the router more fault tolerant and the overhead in trying to manage that accounting at scale becomes almost impossible without adding significant overhead to every customer request and degrading performance for everyone.

I bristle when people seem to think there was some nefarious revenue incentive from Heroku here. They were well reasoned platform changes to improve availability and increase performance for more efficient languages. In the process the marketing got disconnected from the implementation.