Hacker News new | ask | show | jobs
by apinstein 4867 days ago
Because to be intelligent, you have to have the router talk to all the dynos to calculate load. Doing that in a performant way can get tricky, especially since people can hit a button and get 100 workers. The bigger the n, the more resources are required to track everything and the more things can go wrong.

It's not an intractable problem, but it's not trivial, affects only a small percentage of customers, and introduces complexity for everyone.

I feel pretty confident that there is a reasonable solution, and as someone that just spent the last 3 weeks building a custom buildpack and a new heroku app for an auto-scaling worker farm, I am happy to see such a quick, hopeful response.

1 comments

I don't think this only affects a small percentage of customers. Maybe only a small percentage will notice.

This will affect any user with a high standard deviation in their application's response time.

Lets take an extreme case as an example: An application that has an average response time of 100ms, however 1% of the responses have a 3s response time. They have relatively small load and they only have 2 web dynos running. The admin thinks: We have this occasional slow response, but it should be fine. When one dyno is chewing on the 3s task, the other dyno will pick up the slack. Wrong.

With random routing, when one dyno is chewing on the slow task, 50% of the incoming requests are stacking up in that dyno's queue. The other dyno may be able to easily handle it's load, but half of your responses are still getting hit with 3s+ delays.

This is an extreme example, but this is not a rare issue. As the admin, unless you know about this issue, you will be perplexed by the seemingly random slow response times your users will be reporting. You won't see the problem in your logs, or your New Relic performance reports, but your customers will notice.

As others have pointed out, a major selling point of Heroku is it is supposed to "just work." These sorts of issues are supposed to be intelligently handled by their super-slick infrastructure. In my opinion, this is a serious issue. The fact that this has been biting users for 3 years now and Heroku is only willing to address the problem after they get major bad press is disheartening.

I have always been impressed by Heroku, especially how they constantly step up, admit their mistakes, and appear to be as transparent as possible about how they will fix their issues. This situation is seriously disappointing.

I am sure this is a very difficult problem to solve at their scale, but this is really what we as customers are paying them to solve.

If you can add even one worker per dyno, things get much better, faster. See http://rapgenius.com/1504221/James-somers-herokus-ugly-secre... for details.

The really big problem there is having dynos that can handle only one request at a time.