|
|
|
|
|
by ibdknox
4851 days ago
|
|
We just seem to have different definitions of "fix". Fix, to me, implies the issue goes away. Are 1, 2, and 3 important? Yes. 4 should never have been an issue to begin with. And 5 is a non-solution given that simply adding more lines of execution does not address the root problem. In no way have they solved the actual issue (a poor queuing strategy). And so even if you now know that you're getting awful performance due to queuing and you even try to get a multi-threaded strategy going per their suggestion, you will see the exact same issue at scale. That is not a fix. Their stance on actually implementing a strategy that removes the root issue has been one of silence. Suggesting that "this probably won't be the end of it" isn't useful if you're running a business that relies on Heroku. If that isn't the end of it, then they should be far more communicative about the steps they're taking. Given their blog posts, we have no evidence that further solutions to this problem are being worked on or that they even acknowledge it's something they should fix. So no, I do not agree with you that that is a lot of "fixing". |
|
Actually more threads of execution does solve the problem. The difference with just doubling the number of dynos is that on a single dyno requests can be routed intelligently. The reason why random routing sucks is that request processing times have a fat tailed distribution: there is a small but still significant chance that a request takes really long. If you have that request routed to a random single threaded dyno, then all further requests routed to that dyno have to wait very long before they can be processed. If however you had multiple threads of execution on the dyno, the other requests would simply go to the other thread of execution. So now there would only be blocking if a single dyno gets N really long requests at roughly the same time, where N is the number of concurrent threads the dyno is running. The probability of getting N expensive requests to the same dyno at approximately the same time decreases very fast with increasing N.
Hand waving ahead! Lets say the probability of an expensive request blocking a dyno is p = 2%. Then if you double the number of dynos the probability of blocking a dyno is now p/2 = 1%. If however you have two execution threads on each dyno, the probability of blocking a dyno is now p^2 = 0.01%. If you have 10 execution threads it is p^10 which is very small indeed.
Here is a paper about it which makes that intuition precise and shows that even N=2 is a massive improvement over N=1: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...
The problem is that this only works if each concurrent process of your application doesn't use too much memory, since the available memory on one dyno is quite low. For many applications you can't easily have multiple threads of execution on one dyno. The real solution is to have some form of intelligent routing. As the hand waving and the paper above shows, you can make groups of dynos, and then the main router routes to a random group, and within each group requests are routed intelligently. You can take the size of a group to be a small constant, say 10 dynos. So there shouldn't be any scalability problems with this routing approach. If you take the group size small enough, you could even run each group of dynos on a single physical machine, which would make intelligent routing among them even simpler.