| HN Mirror

I don't think this only affects a small percentage of customers. Maybe only a small percentage will notice.

This will affect any user with a high standard deviation in their application's response time.

Lets take an extreme case as an example: An application that has an average response time of 100ms, however 1% of the responses have a 3s response time. They have relatively small load and they only have 2 web dynos running. The admin thinks: We have this occasional slow response, but it should be fine. When one dyno is chewing on the 3s task, the other dyno will pick up the slack. Wrong.

With random routing, when one dyno is chewing on the slow task, 50% of the incoming requests are stacking up in that dyno's queue. The other dyno may be able to easily handle it's load, but half of your responses are still getting hit with 3s+ delays.

This is an extreme example, but this is not a rare issue. As the admin, unless you know about this issue, you will be perplexed by the seemingly random slow response times your users will be reporting. You won't see the problem in your logs, or your New Relic performance reports, but your customers will notice.

As others have pointed out, a major selling point of Heroku is it is supposed to "just work." These sorts of issues are supposed to be intelligently handled by their super-slick infrastructure. In my opinion, this is a serious issue. The fact that this has been biting users for 3 years now and Heroku is only willing to address the problem after they get major bad press is disheartening.

I have always been impressed by Heroku, especially how they constantly step up, admit their mistakes, and appear to be as transparent as possible about how they will fix their issues. This situation is seriously disappointing.

I am sure this is a very difficult problem to solve at their scale, but this is really what we as customers are paying them to solve.