Hacker News new | ask | show | jobs
by peterwaller 3997 days ago
There are two other solutions that spring to mind, which might require quite a bit less code:

1) Take the original code, do the upload exactly in place in the original request (not even spawning a goroutine). However: protect the upload with a semaphore which only allows N-in-flight.

My reasoning is, well, if the system operates with low latency when operating nominally, blocking the incoming request isn't too painful. The reason there was a problem in the first place that there were too many requests in flight and the system hit a meta-stable state where no requests could complete efficiently.

2) (or instead of (1)): If you're going to have a worker pool, why have that complicated chan-chan-Job business? It seems that `func StartProcessor` was close to being a viable solution. All you need is to start a few of those in parallel, each reading from the same `Queue`. Was there a reason to introduce the `WorkerPool chan chan Job`? That looks quite a bit more complicated than it needs to be. The queues don't need to be separate per worker unless there is some other substantial reason.

--

The next thing one would need to take care of is to ensure that the whole system doesn't stall due to a broken/laggy network, so, to put some timeouts on the S3 uploads, for example, to ensure the system can return to a stable state on its own when the thundering herd has passed.

3 comments

Re: Semaphore: Not dropping the connection might mean we might starve others incoming msgs of resources (which might be a good thing) [0]. Also, releasing the semaphore back into the pool in case of failures takes on a happy complication of having to deal with errors beyond one's control.

Re: Queue: Wouldn't the queue involve locking lest two workers end up trying to work on the same request? To be completely concurrent, I guess one could use a lock-free data structure instead (or implement one on top of something like RocksDB)?

[0] http://ferd.ca/queues-don-t-fix-overload.html

[0] http://engineering.voxer.com/2013/09/16/backpressure-in-node...

Semaphore, I'm suggesting:

  Block on N-Semaphore, with timeout
  Do timeout upload
  Replace N-Semaphore
If you don't always replace the semaphore, that's a bug.

Queue: I'm just comparing to what the article does. It already has contention on a queue (the chan), it's just the chan-chan-Worker rather than chan-Job. In practice, go channels happily handle millions of messages contending to multiple workers just fine. Consider this test example where you aren't even actually burning any CPU to perform the work:

    package main
    
    func main() {
        q := make(chan int)    
        for i := 0; i < 10; i++ {    
                go func() {    
                        for x := range q {    
                                x = x * 10    
                        }    
                }()    
        }    
    
        for i := 0; i < 1000000; i++ {    
                q <- i    
        }    
    }
On my laptop, it runs in 0.333s single core, and it's slightly slower when you set GOMAXPROCS > 1. But not much slower, the total runtime goes to 0.4-0.5s or so. (Measured with go 1.4). As soon as you do any actual work with the messages you are passing around, the overhead of locking will be lost in the noise.
Yeah I thought the double chan was a bit strange. It's like mutliple lines at checkout instead of one big line, it just increases the variance.
Timeout/Retry/ExponentialBackOff