Hacker News new | ask | show | jobs
Handling 1M Requests per Minute with Go (marcio.io)
145 points by mcastilho 3996 days ago
20 comments

There are two other solutions that spring to mind, which might require quite a bit less code:

1) Take the original code, do the upload exactly in place in the original request (not even spawning a goroutine). However: protect the upload with a semaphore which only allows N-in-flight.

My reasoning is, well, if the system operates with low latency when operating nominally, blocking the incoming request isn't too painful. The reason there was a problem in the first place that there were too many requests in flight and the system hit a meta-stable state where no requests could complete efficiently.

2) (or instead of (1)): If you're going to have a worker pool, why have that complicated chan-chan-Job business? It seems that `func StartProcessor` was close to being a viable solution. All you need is to start a few of those in parallel, each reading from the same `Queue`. Was there a reason to introduce the `WorkerPool chan chan Job`? That looks quite a bit more complicated than it needs to be. The queues don't need to be separate per worker unless there is some other substantial reason.

--

The next thing one would need to take care of is to ensure that the whole system doesn't stall due to a broken/laggy network, so, to put some timeouts on the S3 uploads, for example, to ensure the system can return to a stable state on its own when the thundering herd has passed.

Re: Semaphore: Not dropping the connection might mean we might starve others incoming msgs of resources (which might be a good thing) [0]. Also, releasing the semaphore back into the pool in case of failures takes on a happy complication of having to deal with errors beyond one's control.

Re: Queue: Wouldn't the queue involve locking lest two workers end up trying to work on the same request? To be completely concurrent, I guess one could use a lock-free data structure instead (or implement one on top of something like RocksDB)?

[0] http://ferd.ca/queues-don-t-fix-overload.html

[0] http://engineering.voxer.com/2013/09/16/backpressure-in-node...

Semaphore, I'm suggesting:

  Block on N-Semaphore, with timeout
  Do timeout upload
  Replace N-Semaphore
If you don't always replace the semaphore, that's a bug.

Queue: I'm just comparing to what the article does. It already has contention on a queue (the chan), it's just the chan-chan-Worker rather than chan-Job. In practice, go channels happily handle millions of messages contending to multiple workers just fine. Consider this test example where you aren't even actually burning any CPU to perform the work:

    package main
    
    func main() {
        q := make(chan int)    
        for i := 0; i < 10; i++ {    
                go func() {    
                        for x := range q {    
                                x = x * 10    
                        }    
                }()    
        }    
    
        for i := 0; i < 1000000; i++ {    
                q <- i    
        }    
    }
On my laptop, it runs in 0.333s single core, and it's slightly slower when you set GOMAXPROCS > 1. But not much slower, the total runtime goes to 0.4-0.5s or so. (Measured with go 1.4). As soon as you do any actual work with the messages you are passing around, the overhead of locking will be lost in the noise.
Yeah I thought the double chan was a bit strange. It's like mutliple lines at checkout instead of one big line, it just increases the variance.
Timeout/Retry/ExponentialBackOff
I was confused by the numbers at first, so: that's 17k requests per second, spread out over 4 dual core Xeon (Haswell) machines, which works out to just over 4000 requests/s per machine. It's still a respectable number, but it's much closer to what one would expect given the task.

Don't get me wrong, the most interesting part is definitely the implementation and as a Go noob I found it very useful - it's just a bit misleading for the headline to sum your request rate across all parallelized machines.

>> But since the beginning, our team knew that we should do this in Go because during the discussion phases we saw this could be potentially a very large traffic system

I don't get this reasoning.

They're unfamiliar with network programming, and so picked a language that seems fast to them, is my assessment.

Based on their description, they should be IO bound. If they're IO bound, their system should spend the vast majority of time in the kernel executing system calls.

If they do that, then whichever language they use has very little relevance to whether or not their system ends up being fast.

EDIT: As an example, my first ever production Ruby app was a messaging middleware server that processed about 3 million messages a day (34/sec) on 10% of a single mid-range 2005-era Xeon core. It was totally unoptimized, and used select() instead of the more modern, faster alternatives. Also, only about 10% of that time was spent in user space. 90% of it was in the kernel, executing system calls, so most meaningful optimization would be in improving the usage of system calls that are exactly the same across languages.

At the time, going beyond a single core required multiple instances, as Ruby lacked native threads back then. But processing 500-600/sec across 2-3 processes on a dual core 2005-era Xeon would not have been challenging even then with that quite naive and dated approach.

As someone else pointed out, they're doing ~4k/sec on a modern dual-core Xeon. Based on the above I'd expect it to be fairly easy to match their ~4K/sec with Ruby today.

It's less about Go being incredibly fast than about Ruby being anomalously slow. It makes sense if you start from the presumption that Go is the only "fast" language they're considering.
Frankly, if you're going to be processing requests and shuffling them to S3, Ruby is nowhere near being the limiting factor unless you do something pathologically bad.

Having spent time recently optimizing a setup that does pretty much exactly what they're doing (in Rails; I don't like Rails, but in this case Rails is not a problem), the time saved doing things like eliminating unnecessary buffering all over the place (e.g. POSTs gets buffered by pretty much everything that likes to consider itself a web server, sometimes multiple times; if you e.g. run Nginx in front of pretty much any Ruby web servers running Rails, worst case you may end up with things passed through at least 3, possibly 4 buffers).

Basically they should be IO bound.

But that would prevent the knee-jerk tribal language war response.
They're going from Ruby, so Go is a pretty easy transition with respect to skill set and helped them accomplish scale (as seen from the analysis).

I agree that other languages could accomplish the same effect, but I'm not surprised by their sentiment given where they were coming from.

Go is 1-2 orders of magnitude faster than Ruby, and much easier to write concurrent code in.

It makes perfect sense to me. What would you have recommended them, for a reasonably-high-performance server implementation? (Please don't say C.)

https://benchmarksgame.alioth.debian.org/u64q/benchmark.php?...

>>> It makes perfect sense to me

Yes. Other than some very basic scripts, I have never worked with Ruby, so I was not aware that it is so slow. Thanks for pointing that out.

As for my recommendation, it is pretty standard worker architecture:

Their system seems to be an ingesting-only system, that is, the clients are getting an empty HTTP 200 OK response. Given this, I would put openresty (nginx) in the front, with some trivial Lua code[1] to en-queue payloads to beanstalkd. Then, you can either have your workers inside openresty (using Openresty timers) or have them as separate processes and written in the language of choice. We have been using this for a couple of years now and it is working really well for our use case, also an ingesting-only system.

[1] https://github.com/smallfish/lua-resty-beanstalkd/blob/maste...

we use the same technology with KONG [1]

[1] https://github.com/mashape/kong

Scala, Java, Clojure, C#, F#, Node, LuaJIT, Erlang, Elixir, Haskell, D, Nim... There are lots of reasonably fast options out there that aren't C.
> Go is 1-2 orders of magnitude faster than Ruby

That may be true for some types of code, but for an app that's predominantly shuffling data over the network you should be spending most of the time in kernel space executing syscalls, and then language differences are largely irrelevant.

> and much easier to write concurrent code in.

How? Writing concurrent code in Ruby is trivial since 1.9.x (prior to 1.9 you had to battle the green threads in MRI for some stuff), which isn't exactly new.

> Go is 1-2 orders of magnitude faster than Ruby, and much easier to write concurrent code in.

Than MRI Ruby you mean.

The advantage wouldn't be as much if Ruby designers cared to add AOT compilation in the same vein as Dylan or Common Lisp to the canonical implementation.

That's exactly the space that Crystal [0] seems to be exploring. Statically type checked and pre-compiled via LLVM. So it isn't the Ruby designers themselves, but definitely Ruby flavored.

0: http://crystal-lang.org/

> > Go is 1-2 orders of magnitude faster than Ruby, and much easier to write concurrent code in.

> Than MRI Ruby you mean.

Given the nature of orders-of-magnitude comparisons and the lack of Ruby implementations that are even one order of magnitude faster than MRI, "...than Ruby" is reasonably accurate if "...than MRI Ruby" is at all accurate.

> The advantage wouldn't be as much if Ruby designers cared to add AOT compilation in the same vein as Dylan or Common Lisp to the canonical implementation.

Maybe, though that's unproven. AFAIK, actual Ruby implementations with AOT only seem to gain about a factor of 2 improvement, not an order of magnitude.

I was always a big fan of Erlang's concurrency model but I've really been getting into Elixir lately. It is just such a productive language to write and lends itself to extremely maintainable codebases.
There's nothing concurrent in this case, parallel sure, but concurrency wise this is all shared nothing. And languages are not faster than one another. They are faster at certain tasks but the variability for a given task is really case specific. For io heavy workloads like this underlying libraries and implementations dominate execution time.

You could have written this in any number of languages. I don't know why go is more logical than say Java JavaScript.

Lua would beat go every day of the week.
[citation needed]
As Clojure would beat Lua every second of a day.
I'm a big clojure fan and not the biggest Lua fan, but I'm afraid you are mistaken. You won't find any dynamically typed language implementation that can come close to LuaJIT for the general use case.
What if you type hint everything in Clojure properly so all reflection is avoided? (does take some time, but with something like YourKit it'll be only a few hours max on a mid-size project)
If you're talking about raw performance, I'm afraid you're mistaken. If you're perhaps referring to the 'aesthetics' of the language syntax, that's pretty subjective, in which case you're right for your tastes.
show me the clojure firewall and I'll take you seriously. Or the even something simple as an interesting clojure ML project. Java gross.
Arguably, UploadToS3 should not be a method of *Payload.

Suggestion: Make an S3-uploader package with _internal_ connection pooling, upload queueing and concurrency handling.

This. Each S3-uploader can then hold a http.Client instance that keeps the connection open to S3 between uploads.
I haven't used this library, but this appears to be what you want. https://github.com/rlmcpherson/s3gof3r

Nice features include:

> Retry Everything: All http requests and every part is retried on both uploads and downloads.

> Configurable conncurrency

> Uses an io.Writer (you could actually start posting to S3 before all of the data gets in on your side.)

etc.

Does anybody here have any experience with Go's garbage collection pauses with large stack sizes? I've got a scala app that regularly consumes about 48G of ram, and I'm very happy with the response times during heavy loads like this, but the P99.5 is abysmal because of garbage collection. I've tried tuning it, but it doesn't seem like anything I do helps. I'll probably end up using an Azul JVM but I'm curious how other languages end up handling this problem.
this article discusses how they handled 69GB of heap and up to 6 seconds of pause times:

http://blog.golang.org/qihoo

the new garbage collector in 1.5 should improve things.

Heap sizes 32gb+ can't use compressed pointers so don't forget about that bit of overhead. 48gb is just in that annoying zone where you need to go above 48gb heap to get real benefits of 32gb+ effecive heap.

But I only maintain EE apps on at most 10gb heaps, so I'm not hugely experienced with tuning this. All I can recommend is heap size and CMS thresholds (no G1 exp) set so that at steady state you don't end up hitting full GCs.

For a JVM heap size greater than 32GB you should be using G1GC.

Tuning GC for Spark: https://www.youtube.com/watch?v=drmJDISLkf4

In the Big Data space we have dozens of machines all with very large stack sizes (I run mine with 250GB) and don't run into any major stop the world pauses.

> I've got a scala app that regularly consumes about 48G of ram, and I'm very happy with the response times during heavy loads like this, but the P99.5 is abysmal because of garbage collection. I've tried tuning it, but it doesn't seem like anything I do helps.

When tuning a prod Scala deployment a while back, I encountered a nasty "pregnant pause" every once in a while similar to what you have experienced. So, below are what I used in annotated form and adapted for the deployment you describe (the max heap setting). Some of these may be completely obvious to you (or others reading), yet are included for completeness.

  -server
  This one should be obvious :-)
  
  -Xmx54G
  Memory is cheap, so give the JVM 54 gig so that GC
  isn't forced to run when your system is in the
  steady-state of 48G heap utilization.
  
  -XX:PermSize=128m -XX:MaxPermSize=1024M
  Ditto on the cheapness of memory.
  
  -Xss1M
  A stack size of 1 meg seems a bit much, but does
  allow for recursive algorithms to operate with
  impunity.
  
  -XX:ReservedCodeCacheSize=128m
  This one was needed for Scala.  It likely should
  be specified with a high value since it limits the
  JIT's code cache.
  
  -XX:+DoEscapeAnalysis
  A nice way to releave some heap pressure[1].
  
  -XX:+UseCodeCacheFlushing
  Should the ReservedCodeCacheSize be exceeded, this
  lets the JIT continue to do its thing in an LRU
  type of fashion (I believe).
  
  -XX:+UseParallelGC
  This one is the most impactful one of all.  A lot
  of people will say "use UseConcMarkSweepGC!"  They
  are wrong for high volume server deployments.  The
  concurrent mark and sweep algorithm caused massive
  "pregnant pauses" in prod for me!  The Parallel GC
  algorithm performs much better under load and
  doesn't cause the VM to sit-and-spin for 20+
  seconds.
  
  -XX:+UseCondCardMark
  Another tweak which had a major performance boost
  for me[2].
  
  -XX:+UseNUMA
  If your servers are NUMA[3] based, then this can
  significantly increase performance[4] as well.
HTH

1 - http://www.ibm.com/developerworks/java/library/j-jtp09275/in...

2 - https://blogs.oracle.com/dave/entry/false_sharing_induced_by...

3 - https://en.wikipedia.org/wiki/Non-uniform_memory_access

4 - http://jose-manuel.me/2011/06/numa-bb/

EDIT: Inserted newlines to eliminate horizontal scrolling.

Have you tried the new G1 GC? I'm not very familiar with GC tuning, but I thought using the CMS GC with vary large heaps is asking for trouble. Apparently the G1 GC alleviates this, and still manages to be as "cool" as the CMS GC.
> Have you tried the new G1 GC?

IIRC, I did and it didn't benchmark well for my needs. However, it all depends on what JRE you're using and what the system is doing to pick the GC which is best for any given deployment. Classic case of YMMV and all that.

In the end, even though it may sound trite, the only way to know what works best for a given combination of JRE/OS/hardware is to measure it. This article[1] had some good tips and Mission Control[2] is a huge help in this arena.

1 - http://www.infoq.com/articles/Tuning-Java-Servers?utm_source...

2 - http://docs.oracle.com/javacomponents/jmc-5-5/jmc-user-guide...

It's really more about how many objects are on the heap, less about how long the heap is.
What is P99.5?
99.5th percentile -- below that threshold, requests are generally fine, but there's a small fraction of requests that take way too long to go through due to GC kicking in.
Cool article, not necessarily because of the language specifics but because of the thought process involved. Thanks!
If this system is merely decoding JSON and writing payloads to S3 for asynchronous data processing, why not have your clients write directly to S3?
I'm not the author of the article, but if I'd have to guess: Data encapsulation, and interface control. If all your clients are talking directly to S3 instead of your encapsulated service interface, you can't inject any business logic at all and must design around the fact that you don't control the service interface, Amazon does.
It's interesting this came up today. I'm looking for a new language to migrate my flask/uwsgi web service to. I'm having a terrible time making it scale.

Are there any tutorials/templates/best practices for writing a small web service in Golang?

The best advice I can give is to write a minimal web server using net/http[0], get a feel for how to write a RESTful HTTP API by hand, and then start experimenting with web frameworks like Gin[1], Martini[2], and Beego[3].

[0] https://golang.org/doc/articles/wiki/

[1] https://github.com/gin-gonic/gin

[2] https://github.com/go-martini/martini

[3] https://github.com/astaxie/beego

I'd recommend against both martini and gin. Neither of which properly use the HTTPHandler interface and the owner of martini has officially stated that martini is not idiomatic.

I loved it when I started, then really hated it when I needed to refactor, there is just too much magic and you pay a speed cost for that magic.

Before you rewrite in another language? Have you benchmarked it? Are you sure you know where the bottleneck is?
the only reason to write something like that in Go is if you're looking to impress a recruiter at (Go). Advise: pick a better tool for your problem.
This is terrible advice. Golang may have a flavor-of-the-week status in some people's minds, but it's a language that really deserves more credit. It is really easy to program and learn, and it does a lot of stuff other languages rely on third party programs for. With each release, nagging issues (namely around garbage collection) are getting resolved, but it's more than production ready. To ignore Golang right now would be akin to ignoring Java back in the early 2000s in my opinion.
Golang has my attention, but I don't think it's anywhere near Java, at least popularity-wise, in the early 2000s. By then most schools had already switched their language of choice to Java - I'm not aware of any that has switched theirs to Golang.

I do enjoy coding in Golang, but we use mostly Java where I work, and for us, the benefits don't make up for the things we lose. This blog post is a great example: the solution they had to find is the first thing you'd probably do in Java, because Java has a standard package with all sorts of concurrency patterns.

Yeah, I remember wanting to do a fan-out pattern in Go and reading the Go Pipelines and Cancellation article[0]. I saw the function merge() in the article and thought, "Great, here's what I need!" I then proceeded to read further and saw that I have to define this function myself based on the types I'm using, which made me quite sad.

Go really needs a library for these patterns built in... I assume the lack of generics prevents users from creating that themselves (I'm not trying to start a language war here, seriously).

[0] http://blog.golang.org/pipelines

"To ignore Golang right now would be akin to ignoring Java back in the early 2000s in my opinion."

Except Go provides limited value for somebody who already knows .net or a jvm language. And that's a huge chunk of the market.

And the key advantage that Go offers to business is that they can easily hire from a pool of experienced programmers and have them learn Go with little downtime.

Sitting at a corporation and learning Java may have been a popular thing to do to join the "boys" club in the early 2000s but I promise you that won't last forever. It's not bad advise to pick a target and actually point at it instead of wasting time solving a problem that has already been fixed.
A better tool such as?
depends a lot on what your constraints are and what goal you're trying to achieve. Go could be the right choice, but I believe the prior poster was alluding to the fact that it does not fit the sinatra/flask/etc. use case very well for most use cases.
You can put up any large number of request if you make the period in 'per {{period}}' large enough.
Little confused here.So the third solution mentioned in the post is just a worker pool? I think if we just slightly modified the second solution, it will totally work.

1).Initialize a job channel

2).Initialize a set of workers that listen to this channel to pull the jobs indefinitely. In this case just call go StartProcessor() for fixed number of times.

What confuses me is that IMO workPoolChannel isn't necessary here. What is the consideration behind to use a channel for workers?

IMHO this solution is nice but wrong. The point is not just creating a "cool" program with Go that will handle HTTP requests.

Without really knowing the company's needs, I am relying on this paragraph from the post:

While working on a piece of our anonymous telemetry and analytics system, our goal was to be able to handle a large amount of POST requests from millions of endpoints. The web handler would receive a JSON document that may contain a collection of many payloads that needed to be written to Amazon S3, in order for our map-reduce systems to later operate on this data.

Knowing this, I would build it differently.

1. Clients post to S3 Directly 2. Lambda -> Overload business logic, private data, cleanup, spam control etc... 3. Prepare files (64M) for Hadoop 4. Hadoop

There's no reason to have that proxy in the middle, Amazon S3 will handle those millions of requests with no real trouble, I wouldn't throw machines on this process.

Other than the worker/concurrency mechanism being part of the language, what's the difference to a RabbitMQ-Worker architecture? You might even argue that the lack of persistence (in the given example) is a potential source for data loss.
I'm guessing persistence is not a priority.

The difference is management of a simple process (behind elastic load balancer, of course) vs a more complicated architecture with three distinct, load balanced process types (webserver -> queue -> worker).

Is this supposed to be great performance? I think Netty does ~30K/s (1800000 req/min) out of the box. It thought Go has more out of the box performance, maybe I am missing something.
Oh, performance? Let's throw numbers around! Here, Elixir/Phoenix beats them all!

https://twitter.com/julianobs/status/614416512825323520

Hey, and Elixir is already way more expressive than Go and it's incredibly easy to build fault-tolerant and distributed systems, not to mention the productivity gains when using the phoenix framework!

Seriously, posting requests/second metric without any context about hardware and sample code doesn't help anyone.

Cool it is only 10-16x slower than Aleph. https://github.com/ptaoussanis/clojure-web-server-benchmarks...
full blown mvc framework vs communication layer, seems legit :)

but hey, as long as stuff responds in microseconds with zero errors under load, just use it ! (Also, clojure is a way better language than go, too.)

Yes I like to throw meaningless numbers around as much as the other guy. :)
Can already do this with a single thread using epoll. In fact, even a simple epoll implementation can handle 1M HTTP requests in ~30 seconds on a single thread juggling 10k connections.
Have you considered in putting KONG [1] which is basically OpenResty (nginx) in the front? With LuaJIT performances [2] are outstanding.

[1] https://github.com/mashape/kong

[2] https://github.com/mashape/kong#benchmarks

16k requests per second isn't fast.
16k requests per second is not fast if you're talking about fetching a page or doing a minimal amount of I/O.

16k requests per second is worth writing about if you're talking about a process with substantial side effects (S3 I/O).

If the language implementation is an issue when doing this, you're doing something wrong.

I'm working on a contract doing this exact thing at the moment, and unsurprisingly the Ruby code that's in the picture is responsible for just a tiny little fraction of the overall latency.

You should spend most of the time waiting on S3, pretty much no matter what language you choose. If you don't, it's not the fault of whichever language you use.

In fact, S3 performance is so dominant in this type of scenario if you do things properly, that if you want to optimize for speed and can afford to risk data loss (or in our case, the clients poll for confirmation and re-upload data in the rare event of loss), it's generally proving substantially better to write to local disk and do uploading to S3 in the background.

It's just pass through. The limit in this instance is probably packets per second of the legacy AWS network. Why is this difficult?
Read the analysis. First attempt was "just a pass through".
No, it is just pass through, they just did a crappy job of handling concurrency.
Maybe they should hire somebody smart like you.
Can you please explain why to use go and not c++/c

forget about language syntax / compilation complexity.

say i know both very well , now why to go with "GO" ?

thanks

There are a few benefits of Go I can think of

* Fewer lines of code, fewer gotchas and hence easier to reason about and maintain.

* Powerful concurrency primitives (channels, select) built right into the language, rather than a library. A scalable producer-consumer implementation would probably be 100 lines of Go code.

* If your application isn't too latency sensitive (game server, frequency trading etc) then the GC simplifies matters. Its guaranteed to run for a maximum of 10ms out of every 50ms which is good enough for most applications. (but typically runs for around 1ms)

* Some of the tooling around the language is great. There are some great articles (I remember one posted to HN yesterday) about how people wrangled a lot more performance out of their code using the profile tool, for instance.

* Miscellaneous goodies like testing out of the box, an extensive standard library and being able to compile in 1/10th of the time.

An example of a service migrated from C++ to Go - dl.google.com - http://talks.golang.org/2013/oscon-dl.slide#1

These are the benefits I could think of if a programmer knows C++ and Go equally well. However, suppose he has to work with fellow programmers who aren't comfortable with either, Go would be a superior choice. It would take a week to learn most of Go and perhaps a month to grok it. I think C++ takes much, much longer than that to learn properly.

Thank you for the reasoned reply.
You're welcome :)
Would have been much simpler and more straight forward to use a library like Akka instead of manually coding all of this.
What's the trick to resubmit without getting stuck going to the first post? For example, I submitted this story 2 hours before this one:

https://news.ycombinator.com/item?id=9844826

In the past, when I try to submit a story, even if it's a couple days old, the submission is ignored, and my vote is added to the original.

The slash at the end of the URL made the difference in this case.
Ouch, it really shows that getting to the HN frontpage can be just a lottery.
You can always cheat by appending garbage with ? or #

Please only do this if you really believe that you should resubmit.

One trick is to add an anchor tag to the end of the URL.
Mods, can we change "go" in the title to "golang"? That's the official name, and it makes it searchable by search engines?
I'm sorry but the official name is Go. You can search for "Golang" or "Go Language" with no issue.
Would this thread come up under one of your suggested searches?
both the original article and the Reddit discussion come up in my results (as 1 and 2). the reddit discussion is only 16 hours old.

https://www.google.com/search?q=golang%201%20million&rct=j

this HN discussion is too recent to show up in my results, however yesterday's thread about Qihoo and golang does show up as number 4 or 5 when searching for "golang qihoo".

no need to panic.

I still see it written as "Go" most of the time. It's unfortunate because I've picked up the habit of searching with "golang".
Go is the official name, golang is the only way to get reliable search results however.