Hacker News new | ask | show | jobs
by dkhenry 4850 days ago
This is the tech world equivalent of tabloids. Please don't promote this mindless back and forth, If you have a problem with Heroku leave and go to one of the other providers. If you don't stay and push them to fix this problem. Either way stop pretending this is some huge event that we must mindlessly obsess over
1 comments

Indeed, especially considering it's painfully obvious that the problem isn't on Heroku's side but rather on their app's dismal performance. You should be able to easily do a couple of dozen requests per second; this is the kind of performance we're getting out of a single Heroku dyno on a dynamic page with no caching:

  $ ab -n 1000 -c 20 https://*****-staging.herokuapp.com/**********
  This is ApacheBench, Version 2.3 <$Revision: 655654 $>
  Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
  Licensed to The Apache Software Foundation, http://www.apache.org/
  
  Benchmarking *****-staging.herokuapp.com (be patient)
  Completed 100 requests
  Completed 200 requests
  Completed 300 requests
  Completed 400 requests
  Completed 500 requests
  Completed 600 requests
  Completed 700 requests
  Completed 800 requests
  Completed 900 requests
  Completed 1000 requests
  Finished 1000 requests
  
  
  Server Software:        
  Server Hostname:        *****-staging.herokuapp.com
  Server Port:            443
  SSL/TLS Protocol:       TLSv1/SSLv3,AES256-SHA,2048,256
  
  Document Path:          /**********
  Document Length:        9670 bytes
  
  Concurrency Level:      20
  Time taken for tests:   7.130 seconds
  Complete requests:      1000
  Failed requests:        0
  Write errors:           0
  Total transferred:      10034000 bytes
  HTML transferred:       9670000 bytes
  Requests per second:    140.25 [#/sec] (mean)
  Time per request:       142.606 [ms] (mean)
  Time per request:       7.130 [ms] (mean, across all concurrent requests)
  Transfer rate:          1374.25 [Kbytes/sec] received
  
  Connection Times (ms)
                min  mean[+/-sd] median   max
  Connect:       55   59  31.7     58    1057
  Processing:    37   82  43.8     66     308
  Waiting:       35   74  42.7     57     298
  Total:         92  141  53.4    124    1096
  
  Percentage of the requests served within a certain time (ms)
    50%    124
    66%    138
    75%    153
    80%    166
    90%    199
    95%    239
    98%    282
    99%    301
   100%   1096 (longest request)
Edit: formatting.
> this is the kind of performance we're getting out of a single Heroku dyno on a dynamic page with no caching

If you read the original article[0], you would know that this is a problem that only affects apps with large number of dynos.

I have not done queuing theory in a long time, but my initial sense is that the math on this one will be generalization of the birthday problem [1], which is Wiki-notable on the sole basis that the probability of sharing a birthday (or in our case, the probability of queueing a request) is far, far higher than ordinary people anticipate for N above 23. Assuming I've captured the essence of the problem correctly, you would see a sharp drop in performance when you start to saturate about 20-30 dynos.

Given that there's an entire Wikipedia article on the sole basis that the behavior of these mathematical functions are nonintuitive, I think it is pretty fair to give RapGenius a pass at being surprised by the math as well.

[0] http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics [1] http://en.wikipedia.org/wiki/Birthday_problem

I did read the original article. The problem is that their stack is not concurrent.

In a non-concurrent web application stack(like Rails), one request is processed at a time and further requests to the same node are queued. This means that if some request takes five seconds to answer, everybody that is queued on that node after that long request has to wait until the first request is fulfilled. That's the behavior they're seeing.

In a multithreaded or reactive web stack, other requests will get processed alongside the long request and, guess what, the problem doesn't happen unless all worker threads are processing long requests because the short requests will get processed alongside the long one by the other workers.

Assuming your stack has, say, 20 worker threads, the probability of your random load balancer overloading your node with 60 long requests given a large enough pool is small, assuming long requests are a small fraction of your load. If your concurrency level is 1, the probability of your node getting overwhelmed by long requests is much higher.

You can see it this way; if you have a stack that can only process one request at a time, the probability of that one single request processor getting backlogged is getting three heads in a row on a non-biased coin. If you have twenty request processors, the probability of that node being backlogged is getting three heads in a row for all twenty processors. Much less likely to happen.

They were told to run Unicorn, which from my understanding just forks the Ruby interpreter a couple of times to run in parallel. They decided not to (or were unable to).

They decided instead to whine about the problem and ask Heroku to build some magic load balancer that would solve all their problems. Even if they did have a load balancer that did least-conns, all of Heroku's traffic does not go through a single load balancer, meaning that separate load balancers could, through bad luck, allocate their requests to the same unfortunate node. [1]

What they did is amateurish; instead of looking at the problem and fixing it by either multithreading their code or switching away from RoR, they blamed their vendors, just like beginning programmers blame their bugs on the compiler or the libraries they use. When Twitter needed to scale, they moved some of their stuff away from Rails to Scala, Facebook wrote hiphop php, their PHP to C++ transpiler, etc.

Was Heroku completely in the clear? No. Their documentation was misleading and I believe they've admitted that. Was it a problem that New Relic didn't show all the metrics needed to isolate the performance issue? Yes.

We'll see how this whole story unfolds, but from my perspective, the more of a stink RapGenius raises, the more amateurish they look.

[1] http://aphyr.com/posts/277-timelike-a-network-simulator