There are two other solutions that spring to mind, which might require quite a bit less code:
1) Take the original code, do the upload exactly in place in the original request (not even spawning a goroutine). However: protect the upload with a semaphore which only allows N-in-flight.
My reasoning is, well, if the system operates with low latency when operating nominally, blocking the incoming request isn't too painful. The reason there was a problem in the first place that there were too many requests in flight and the system hit a meta-stable state where no requests could complete efficiently.
2) (or instead of (1)): If you're going to have a worker pool, why have that complicated chan-chan-Job business? It seems that `func StartProcessor` was close to being a viable solution. All you need is to start a few of those in parallel, each reading from the same `Queue`. Was there a reason to introduce the `WorkerPool chan chan Job`? That looks quite a bit more complicated than it needs to be. The queues don't need to be separate per worker unless there is some other substantial reason.
--
The next thing one would need to take care of is to ensure that the whole system doesn't stall due to a broken/laggy network, so, to put some timeouts on the S3 uploads, for example, to ensure the system can return to a stable state on its own when the thundering herd has passed.
Re: Semaphore:
Not dropping the connection might mean we might starve others incoming msgs of resources (which might be a good thing) [0]. Also, releasing the semaphore back into the pool in case of failures takes on a happy complication of having to deal with errors beyond one's control.
Re: Queue:
Wouldn't the queue involve locking lest two workers end up trying to work on the same request? To be completely concurrent, I guess one could use a lock-free data structure instead (or implement one on top of something like RocksDB)?
Block on N-Semaphore, with timeout
Do timeout upload
Replace N-Semaphore
If you don't always replace the semaphore, that's a bug.
Queue: I'm just comparing to what the article does. It already has contention on a queue (the chan), it's just the chan-chan-Worker rather than chan-Job. In practice, go channels happily handle millions of messages contending to multiple workers just fine. Consider this test example where you aren't even actually burning any CPU to perform the work:
package main
func main() {
q := make(chan int)
for i := 0; i < 10; i++ {
go func() {
for x := range q {
x = x * 10
}
}()
}
for i := 0; i < 1000000; i++ {
q <- i
}
}
On my laptop, it runs in 0.333s single core, and it's slightly slower when you set GOMAXPROCS > 1. But not much slower, the total runtime goes to 0.4-0.5s or so. (Measured with go 1.4). As soon as you do any actual work with the messages you are passing around, the overhead of locking will be lost in the noise.
I was confused by the numbers at first, so: that's 17k requests per second, spread out over 4 dual core Xeon (Haswell) machines, which works out to just over 4000 requests/s per machine. It's still a respectable number, but it's much closer to what one would expect given the task.
Don't get me wrong, the most interesting part is definitely the implementation and as a Go noob I found it very useful - it's just a bit misleading for the headline to sum your request rate across all parallelized machines.
>> But since the beginning, our team knew that we should do this in Go because during the discussion phases we saw this could be potentially a very large traffic system
They're unfamiliar with network programming, and so picked a language that seems fast to them, is my assessment.
Based on their description, they should be IO bound. If they're IO bound, their system should spend the vast majority of time in the kernel executing system calls.
If they do that, then whichever language they use has very little relevance to whether or not their system ends up being fast.
EDIT: As an example, my first ever production Ruby app was a messaging middleware server that processed about 3 million messages a day (34/sec) on 10% of a single mid-range 2005-era Xeon core. It was totally unoptimized, and used select() instead of the more modern, faster alternatives. Also, only about 10% of that time was spent in user space. 90% of it was in the kernel, executing system calls, so most meaningful optimization would be in improving the usage of system calls that are exactly the same across languages.
At the time, going beyond a single core required multiple instances, as Ruby lacked native threads back then. But processing 500-600/sec across 2-3 processes on a dual core 2005-era Xeon would not have been challenging even then with that quite naive and dated approach.
As someone else pointed out, they're doing ~4k/sec on a modern dual-core Xeon. Based on the above I'd expect it to be fairly easy to match their ~4K/sec with Ruby today.
It's less about Go being incredibly fast than about Ruby being anomalously slow. It makes sense if you start from the presumption that Go is the only "fast" language they're considering.
Frankly, if you're going to be processing requests and shuffling them to S3, Ruby is nowhere near being the limiting factor unless you do something pathologically bad.
Having spent time recently optimizing a setup that does pretty much exactly what they're doing (in Rails; I don't like Rails, but in this case Rails is not a problem), the time saved doing things like eliminating unnecessary buffering all over the place (e.g. POSTs gets buffered by pretty much everything that likes to consider itself a web server, sometimes multiple times; if you e.g. run Nginx in front of pretty much any Ruby web servers running Rails, worst case you may end up with things passed through at least 3, possibly 4 buffers).
Yes. Other than some very basic scripts, I have never worked with Ruby, so I was not aware that it is so slow. Thanks for pointing that out.
As for my recommendation, it is pretty standard worker architecture:
Their system seems to be an ingesting-only system, that is, the clients are getting an empty HTTP 200 OK response. Given this, I would put openresty (nginx) in the front, with some trivial Lua code[1] to en-queue payloads to beanstalkd. Then, you can either have your workers inside openresty (using Openresty timers) or have them as separate processes and written in the language of choice. We have been using this for a couple of years now and it is working really well for our use case, also an ingesting-only system.
That may be true for some types of code, but for an app that's predominantly shuffling data over the network you should be spending most of the time in kernel space executing syscalls, and then language differences are largely irrelevant.
> and much easier to write concurrent code in.
How? Writing concurrent code in Ruby is trivial since 1.9.x (prior to 1.9 you had to battle the green threads in MRI for some stuff), which isn't exactly new.
> Go is 1-2 orders of magnitude faster than Ruby, and much easier to write concurrent code in.
Than MRI Ruby you mean.
The advantage wouldn't be as much if Ruby designers cared to add AOT compilation in the same vein as Dylan or Common Lisp to the canonical implementation.
That's exactly the space that Crystal [0] seems to be exploring. Statically type checked and pre-compiled via LLVM. So it isn't the Ruby designers themselves, but definitely Ruby flavored.
> > Go is 1-2 orders of magnitude faster than Ruby, and much easier to write concurrent code in.
> Than MRI Ruby you mean.
Given the nature of orders-of-magnitude comparisons and the lack of Ruby implementations that are even one order of magnitude faster than MRI, "...than Ruby" is reasonably accurate if "...than MRI Ruby" is at all accurate.
> The advantage wouldn't be as much if Ruby designers cared to add AOT compilation in the same vein as Dylan or Common Lisp to the canonical implementation.
Maybe, though that's unproven. AFAIK, actual Ruby implementations with AOT only seem to gain about a factor of 2 improvement, not an order of magnitude.
I was always a big fan of Erlang's concurrency model but I've really been getting into Elixir lately. It is just such a productive language to write and lends itself to extremely maintainable codebases.
There's nothing concurrent in this case, parallel sure, but concurrency wise this is all shared nothing. And languages are not faster than one another. They are faster at certain tasks but the variability for a given task is really case specific. For io heavy workloads like this underlying libraries and implementations dominate execution time.
You could have written this in any number of languages. I don't know why go is more logical than say Java JavaScript.
I'm a big clojure fan and not the biggest Lua fan, but I'm afraid you are mistaken. You won't find any dynamically typed language implementation that can come close to LuaJIT for the general use case.
What if you type hint everything in Clojure properly so all reflection is avoided? (does take some time, but with something like YourKit it'll be only a few hours max on a mid-size project)
If you're talking about raw performance, I'm afraid you're mistaken. If you're perhaps referring to the 'aesthetics' of the language syntax, that's pretty subjective, in which case you're right for your tastes.
Does anybody here have any experience with Go's garbage collection pauses with large stack sizes? I've got a scala app that regularly consumes about 48G of ram, and I'm very happy with the response times during heavy loads like this, but the P99.5 is abysmal because of garbage collection. I've tried tuning it, but it doesn't seem like anything I do helps. I'll probably end up using an Azul JVM but I'm curious how other languages end up handling this problem.
Heap sizes 32gb+ can't use compressed pointers so don't forget about that bit of overhead. 48gb is just in that annoying zone where you need to go above 48gb heap to get real benefits of 32gb+ effecive heap.
But I only maintain EE apps on at most 10gb heaps, so I'm not hugely experienced with tuning this. All I can recommend is heap size and CMS thresholds (no G1 exp) set so that at steady state you don't end up hitting full GCs.
In the Big Data space we have dozens of machines all with very large stack sizes (I run mine with 250GB) and don't run into any major stop the world pauses.
> I've got a scala app that regularly consumes about 48G of ram, and I'm very happy with the response times during heavy loads like this, but the P99.5 is abysmal because of garbage collection. I've tried tuning it, but it doesn't seem like anything I do helps.
When tuning a prod Scala deployment a while back, I encountered a nasty "pregnant pause" every once in a while similar to what you have experienced. So, below are what I used in annotated form and adapted for the deployment you describe (the max heap setting). Some of these may be completely obvious to you (or others reading), yet are included for completeness.
-server
This one should be obvious :-)
-Xmx54G
Memory is cheap, so give the JVM 54 gig so that GC
isn't forced to run when your system is in the
steady-state of 48G heap utilization.
-XX:PermSize=128m -XX:MaxPermSize=1024M
Ditto on the cheapness of memory.
-Xss1M
A stack size of 1 meg seems a bit much, but does
allow for recursive algorithms to operate with
impunity.
-XX:ReservedCodeCacheSize=128m
This one was needed for Scala. It likely should
be specified with a high value since it limits the
JIT's code cache.
-XX:+DoEscapeAnalysis
A nice way to releave some heap pressure[1].
-XX:+UseCodeCacheFlushing
Should the ReservedCodeCacheSize be exceeded, this
lets the JIT continue to do its thing in an LRU
type of fashion (I believe).
-XX:+UseParallelGC
This one is the most impactful one of all. A lot
of people will say "use UseConcMarkSweepGC!" They
are wrong for high volume server deployments. The
concurrent mark and sweep algorithm caused massive
"pregnant pauses" in prod for me! The Parallel GC
algorithm performs much better under load and
doesn't cause the VM to sit-and-spin for 20+
seconds.
-XX:+UseCondCardMark
Another tweak which had a major performance boost
for me[2].
-XX:+UseNUMA
If your servers are NUMA[3] based, then this can
significantly increase performance[4] as well.
Have you tried the new G1 GC? I'm not very familiar with GC tuning, but I thought using the CMS GC with vary large heaps is asking for trouble. Apparently the G1 GC alleviates this, and still manages to be as "cool" as the CMS GC.
IIRC, I did and it didn't benchmark well for my needs. However, it all depends on what JRE you're using and what the system is doing to pick the GC which is best for any given deployment. Classic case of YMMV and all that.
In the end, even though it may sound trite, the only way to know what works best for a given combination of JRE/OS/hardware is to measure it. This article[1] had some good tips and Mission Control[2] is a huge help in this arena.
99.5th percentile -- below that threshold, requests are generally fine, but there's a small fraction of requests that take way too long to go through due to GC kicking in.
I'm not the author of the article, but if I'd have to guess: Data encapsulation, and interface control. If all your clients are talking directly to S3 instead of your encapsulated service interface, you can't inject any business logic at all and must design around the fact that you don't control the service interface, Amazon does.
It's interesting this came up today. I'm looking for a new language to migrate my flask/uwsgi web service to. I'm having a terrible time making it scale.
Are there any tutorials/templates/best practices for writing a small web service in Golang?
The best advice I can give is to write a minimal web server using net/http[0], get a feel for how to write a RESTful HTTP API by hand, and then start experimenting with web frameworks like Gin[1], Martini[2], and Beego[3].
I'd recommend against both martini and gin. Neither of which properly use the HTTPHandler interface and the owner of martini has officially stated that martini is not idiomatic.
I loved it when I started, then really hated it when I needed to refactor, there is just too much magic and you pay a speed cost for that magic.
This is terrible advice. Golang may have a flavor-of-the-week status in some people's minds, but it's a language that really deserves more credit. It is really easy to program and learn, and it does a lot of stuff other languages rely on third party programs for. With each release, nagging issues (namely around garbage collection) are getting resolved, but it's more than production ready. To ignore Golang right now would be akin to ignoring Java back in the early 2000s in my opinion.
Golang has my attention, but I don't think it's anywhere near Java, at least popularity-wise, in the early 2000s. By then most schools had already switched their language of choice to Java - I'm not aware of any that has switched theirs to Golang.
I do enjoy coding in Golang, but we use mostly Java where I work, and for us, the benefits don't make up for the things we lose. This blog post is a great example: the solution they had to find is the first thing you'd probably do in Java, because Java has a standard package with all sorts of concurrency patterns.
Yeah, I remember wanting to do a fan-out pattern in Go and reading the Go Pipelines and Cancellation article[0]. I saw the function merge() in the article and thought, "Great, here's what I need!" I then proceeded to read further and saw that I have to define this function myself based on the types I'm using, which made me quite sad.
Go really needs a library for these patterns built in... I assume the lack of generics prevents users from creating that themselves (I'm not trying to start a language war here, seriously).
"To ignore Golang right now would be akin to ignoring Java back in the early 2000s in my opinion."
Except Go provides limited value for somebody who already knows .net or a jvm language. And that's a huge chunk of the market.
And the key advantage that Go offers to business is that they can easily hire from a pool of experienced programmers and have them learn Go with little downtime.
Sitting at a corporation and learning Java may have been a popular thing to do to join the "boys" club in the early 2000s but I promise you that won't last forever. It's not bad advise to pick a target and actually point at it instead of wasting time solving a problem that has already been fixed.
depends a lot on what your constraints are and what goal you're trying to achieve. Go could be the right choice, but I believe the prior poster was alluding to the fact that it does not fit the sinatra/flask/etc. use case very well for most use cases.
Little confused here.So the third solution mentioned in the post is just a worker pool? I think if we just slightly modified the second solution, it will totally work.
1).Initialize a job channel
2).Initialize a set of workers that listen to this channel to pull the jobs indefinitely. In this case just call go StartProcessor() for fixed number of times.
What confuses me is that IMO workPoolChannel isn't necessary here. What is the consideration behind to use a channel for workers?
IMHO this solution is nice but wrong.
The point is not just creating a "cool" program with Go that will handle HTTP requests.
Without really knowing the company's needs, I am relying on this paragraph from the post:
While working on a piece of our anonymous telemetry and analytics system, our goal was to be able to handle a large amount of POST requests from millions of endpoints. The web handler would receive a JSON document that may contain a collection of many payloads that needed to be written to Amazon S3, in order for our map-reduce systems to later operate on this data.
Knowing this, I would build it differently.
1. Clients post to S3 Directly
2. Lambda -> Overload business logic, private data, cleanup, spam control etc...
3. Prepare files (64M) for Hadoop
4. Hadoop
There's no reason to have that proxy in the middle, Amazon S3 will handle those millions of requests with no real trouble, I wouldn't throw machines on this process.
Other than the worker/concurrency mechanism being part of the language, what's the difference to a RabbitMQ-Worker architecture? You might even argue that the lack of persistence (in the given example) is a potential source for data loss.
The difference is management of a simple process (behind elastic load balancer, of course) vs a more complicated architecture with three distinct, load balanced process types (webserver -> queue -> worker).
Is this supposed to be great performance? I think Netty does ~30K/s (1800000 req/min) out of the box. It thought Go has more out of the box performance, maybe I am missing something.
Hey, and Elixir is already way more expressive than Go and it's incredibly easy to build fault-tolerant and distributed systems, not to mention the productivity gains when using the phoenix framework!
Seriously, posting requests/second metric without any context about hardware and sample code doesn't help anyone.
Can already do this with a single thread using epoll. In fact, even a simple epoll implementation can handle 1M HTTP requests in ~30 seconds on a single thread juggling 10k connections.
If the language implementation is an issue when doing this, you're doing something wrong.
I'm working on a contract doing this exact thing at the moment, and unsurprisingly the Ruby code that's in the picture is responsible for just a tiny little fraction of the overall latency.
You should spend most of the time waiting on S3, pretty much no matter what language you choose. If you don't, it's not the fault of whichever language you use.
In fact, S3 performance is so dominant in this type of scenario if you do things properly, that if you want to optimize for speed and can afford to risk data loss (or in our case, the clients poll for confirmation and re-upload data in the rare event of loss), it's generally proving substantially better to write to local disk and do uploading to S3 in the background.
* Fewer lines of code, fewer gotchas and hence easier to reason about and maintain.
* Powerful concurrency primitives (channels, select) built right into the language, rather than a library. A scalable producer-consumer implementation would probably be 100 lines of Go code.
* If your application isn't too latency sensitive (game server, frequency trading etc) then the GC simplifies matters. Its guaranteed to run for a maximum of 10ms out of every 50ms which is good enough for most applications. (but typically runs for around 1ms)
* Some of the tooling around the language is great. There are some great articles (I remember one posted to HN yesterday) about how people wrangled a lot more performance out of their code using the profile tool, for instance.
* Miscellaneous goodies like testing out of the box, an extensive standard library and being able to compile in 1/10th of the time.
These are the benefits I could think of if a programmer knows C++ and Go equally well. However, suppose he has to work with fellow programmers who aren't comfortable with either, Go would be a superior choice. It would take a week to learn most of Go and perhaps a month to grok it. I think C++ takes much, much longer than that to learn properly.
this HN discussion is too recent to show up in my results, however yesterday's thread about Qihoo and golang does show up as number 4 or 5 when searching for "golang qihoo".
1) Take the original code, do the upload exactly in place in the original request (not even spawning a goroutine). However: protect the upload with a semaphore which only allows N-in-flight.
My reasoning is, well, if the system operates with low latency when operating nominally, blocking the incoming request isn't too painful. The reason there was a problem in the first place that there were too many requests in flight and the system hit a meta-stable state where no requests could complete efficiently.
2) (or instead of (1)): If you're going to have a worker pool, why have that complicated chan-chan-Job business? It seems that `func StartProcessor` was close to being a viable solution. All you need is to start a few of those in parallel, each reading from the same `Queue`. Was there a reason to introduce the `WorkerPool chan chan Job`? That looks quite a bit more complicated than it needs to be. The queues don't need to be separate per worker unless there is some other substantial reason.
--
The next thing one would need to take care of is to ensure that the whole system doesn't stall due to a broken/laggy network, so, to put some timeouts on the S3 uploads, for example, to ensure the system can return to a stable state on its own when the thundering herd has passed.