I'm sorry, but Tom Lehman sounds like a real dick to me in this interview. Heroku fucked up royally, sure, but why does RapGenius have to keep bashing them even after they started fixing things?
What did they fix? They've side stepped any real solution for the root problem. As such, the only thing they've "fixed" is new relic, by making it report what is actually happening.
2. They wrote several blog posts explaining what happened and what is going to happen now (fixing) and in the future (more fixing)
3. They fixed their documentation
4. They helped a third party service to adapt their offering to better help their customers (NewRelic)
5. They offered their advice for better solutions for affected customers (Unicorn)
This sounds a lot like fixing to me.
And from what they did until know, this probably won't be the end of it. So why not just talk to them directly and see if it's enough for you - and if not just go somewhere else?
We just seem to have different definitions of "fix". Fix, to me, implies the issue goes away. Are 1, 2, and 3 important? Yes. 4 should never have been an issue to begin with. And 5 is a non-solution given that simply adding more lines of execution does not address the root problem.
In no way have they solved the actual issue (a poor queuing strategy). And so even if you now know that you're getting awful performance due to queuing and you even try to get a multi-threaded strategy going per their suggestion, you will see the exact same issue at scale. That is not a fix.
Their stance on actually implementing a strategy that removes the root issue has been one of silence. Suggesting that "this probably won't be the end of it" isn't useful if you're running a business that relies on Heroku. If that isn't the end of it, then they should be far more communicative about the steps they're taking. Given their blog posts, we have no evidence that further solutions to this problem are being worked on or that they even acknowledge it's something they should fix.
So no, I do not agree with you that that is a lot of "fixing".
> And 5 is a non-solution given that simply adding more lines of execution does not address the root problem.
Actually more threads of execution does solve the problem. The difference with just doubling the number of dynos is that on a single dyno requests can be routed intelligently. The reason why random routing sucks is that request processing times have a fat tailed distribution: there is a small but still significant chance that a request takes really long. If you have that request routed to a random single threaded dyno, then all further requests routed to that dyno have to wait very long before they can be processed. If however you had multiple threads of execution on the dyno, the other requests would simply go to the other thread of execution. So now there would only be blocking if a single dyno gets N really long requests at roughly the same time, where N is the number of concurrent threads the dyno is running. The probability of getting N expensive requests to the same dyno at approximately the same time decreases very fast with increasing N.
Hand waving ahead! Lets say the probability of an expensive request blocking a dyno is p = 2%. Then if you double the number of dynos the probability of blocking a dyno is now p/2 = 1%. If however you have two execution threads on each dyno, the probability of blocking a dyno is now p^2 = 0.01%. If you have 10 execution threads it is p^10 which is very small indeed.
The problem is that this only works if each concurrent process of your application doesn't use too much memory, since the available memory on one dyno is quite low. For many applications you can't easily have multiple threads of execution on one dyno. The real solution is to have some form of intelligent routing. As the hand waving and the paper above shows, you can make groups of dynos, and then the main router routes to a random group, and within each group requests are routed intelligently. You can take the size of a group to be a small constant, say 10 dynos. So there shouldn't be any scalability problems with this routing approach. If you take the group size small enough, you could even run each group of dynos on a single physical machine, which would make intelligent routing among them even simpler.
This post should be stickied at the top of every Heroku queueing thread. People keep acting like the "intelligent routing" system is trivial to build and has no overhead which are both patently false. It's clear that they can't go back to the old method with their newer (since 2011) architecture so the solution is for apps to fix their own performance issues.
What I said pretty much implies the opposite, so you may want to retract your endorsement ;-) There are various solutions to this problem but almost all involve some action on Heroku's part.
I never expected them to completely rebuild their service because of some customers (very small minority, I assume) aren't totally happy and satisfied with their product. That clearly sucks for the affected people.
It's reason for them to leave the product and platform and go somewhere else, where the problem is not an integral part of the produt. But it's not a reason to be a dick.
> 4. They helped a third party service to adapt their offering to better help their customers (NewRelic)
Unless I misunderstand the situation, NewRelic's heroku reporting isn't some one sided third party service but rather something that at least seems to be jointly produced by Heroku and NewRelic.
NewRelic can't report something that isn't offered up and it would seem to me that Heroku needs to deliberately expose metrics to the NewRelic plugin for it to be able to pick them up.
As it seems to be that these queue times weren't reported anywhere developer accessible it also stands to reason that they weren't exposed to NewRelic.
So no heroku didn't fix some third party service, they fixed their own service (in this regard).
I'm not entirely sure if the headers the new version of the plugin uses were available before, but it sounded like they were. NR wasn't aware that the one they were using didn't report the queueing time before the dynos and Heroku now helped them to fix that.
So yeah, probably Heroku fixed their part and made sure NewRelic reflected that.
Especially the remarks about the costs and alleged fails of NewRelic seem totally wrong to me.
As a very happy NewRelic customer, I can say they did exactly what they advertise: Help monitor the application performance in the server (!). The queueing that now seems to be a problem of Heroku doesn't happen in the server that is processing the request, so by default it can't show the time needed.
Actually I'm quite sure that using one part of NewRelic, RUM (real user monitoring), should have shown the problem quite obviously. It shows how long a user had to wait for the request complete, including DNS lookup and network time. So if users waited longer for answers to their requests than the backend time should indicate, every developer should have taken this as a hint to investigate further.
Well, even using the application should have been enough to know that something is wrong when NR reports 250ms backend time, but page need at least 1200ms to return first byte to the customer.
Errrm, unless I'm entirely mistaken the problem is that the queueing does happen in the server that's processing the request, and New Relic just doesn't report it.
Well, as I understood it the queueing happens in some kind of load balancer that is responsible for routing the requests between different servers ("dynos" in Heroku speak) to handle the requests. New Relic is a plugin for your server and hooks into Apache and PHP (or this case Rails) to learn how long it takes to process the requests (and retrieve data from the database and/or a cache). This means, to me, that queueing strictly is out of the scope of what NewRelic normally does.
However it's great that they and Heroku now found a way to report the correct queueing time. As far as I understood it, they use a special header added by Heroku to calculate the time themselves and report it in their dashboard.
I would be pretty pissed if I had sunk tens of thousands of dollars and countless hours chasing ghosts. If you're a startup every dollar and every hour lost is especially costly. If Rap Genius ends up going under from running out of money it's impossible to say that this Heroku nonsense isn't at least partially to blame. If Heroku didn't give them the run-around they would have jumped onto EC2 and this problem and the costs it caused would have been completely avoided.
Tom is from New York. He's also a founder. Both roles typically imply/require a certain amount of dickishness.
Thing is, somebody had to take Heroku to task over this, and until they fix the problem somebody has to keep taking them to task.
I worked in the office beside Tom's for a year (pre-Rap Genius). He's a sharp guy. More importantly, he's right. I don't think being nice has any relevance to Rap Genius' bottom line.
If I have a problem with my ISP or my car dealer, nobody bats an eye at a blog post or an online complaint about the service.
But if I'm a customer of an admired former startup to whom I pay hundreds of thousands of dollars a month, I'm not allowed to go public with my complaints when I--and maybe hundreds of others--have been deceived and have suffered intentionally worse service than what I was promised?
I find the "enforced positiveness/optimism" of the startup community very disheartening. The essence of engineering is honesty (preferably quantified) about capabilities and limitations of systems. In this case, a former startup owned by a public company deceived their customers and then papered over (my impression) a valid, quantitatively-documented customer complaint once it became public.
Tom should be commended for speaking out. If he's right, dozens of startups have spent far more of their precious and limited capital on excess dynos and monitoring tools that could have been better spent elsewhere. I can't imagine a better service to the startup community than making this sort of thing public.
We're just very direct. There's not much value placed on unnecessary politeness or platitudes. The biggest social faux pas to a New Yorker is wasting our time.
Some people see that as us being rude and others actually appreciate it.