|
|
|
|
|
by Jare
4868 days ago
|
|
I'm still confused: since you're using NodeJS, I imagine your dynos are effectively handling a large amount of concurrent requests. This should in turn negate any impact of long-running requests, since they don't cause further requests to be queued in any way. So, where are requests (or responses) being queued (or lost) in your app? In Rabbit? Are you getting errors and slowness as streaks rather than isolated random events? Could it be due to spinup time of new backend workers, or something along those lines? |
|
The way my application worked is that we had 'long' running things like saving data to a database happening before we return a response to a client (in this case an iphone app). Sometimes mongo, the dyno, networking, phase of the moon, talking to the facebook api, etc... we would get 'slow' processing and it would take a few seconds for a response to make it to the client. As soon as this happens, on a heavily loaded system, the heroku router would get backed up (since it only routes to 2-3 dynos at a time) and would start throwing H12 errors.
So, what we did was rewrite the entire app to do minimal data processing in the web tier, send the response back to the client as quickly as possible. At the same time, we also send a rabbit queue message out with all the instructions in it to process the data 'offline' in a worker task. There is no spinup since these workers are running all the time. We even have several groups of workers depending on the message type so that we can segregate the work across multiple groups of dyno workers. This also allows us to easily scale to more than a 100 dynos to process messages. It works great. Rabbit is a godsend.
I say 'long' and 'slow' above because the longest amount of time we should be taking is a couple seconds at most. Unfortunately, the way that the heroku router is designed is fundamentally broken. As soon as you get a lot of 'slow' requests going to the same dyno's they start to stack up and the router just starts returning H12 errors. It doesn't matter how many dyno's you have because the router only talks to 2-3 dyno's at a time. We get H12's with 50, 100, 200, 300, etc dynos.
We also saw very strange behavior with the dyno's. We use nodetime to log how long things take and we'd see redis/mongo take only a few ms, but we'd have >15s just for the request to complete... somewhere things are slow and we can't figure out where. Until this whole mess came out, Heroku just pointed fingers at everyone else but themselves.
Oh by the way, as soon as you get around 200-300 dyno's deploys start error'ing out as well because heroku can't start up enough dyno's fast enough and that whole process times out too. You can't tell if a deploy worked or didn't. They didn't seem to care about that at all either.
Anyway, I could keep going... but once again, I'll repeat that I'm glad that the Rapgenius guys are calling Heroku out in public on this stuff. There is some big issues here that need to be addressed and the H12/router stuff is the big issue. I'm looking forward to see how they pull out of this one.