Hacker News new | ask | show | jobs
by RyanZAG 2933 days ago
Kind of defeats the purpose of using golang for a task like this. The whole point of golang is using the little greenlet threads, but actually using them in this case is terrible on performance.

The remaining performance left behind is all in memory allocation and garbage collection - something you could optimize relatively easily if it were written in C. Such as by using a memory pool, so that you wouldn't need allocations or garbage collection at all.

Of course if performance isn't a big issue for your task, then none of this is really important.

5 comments

Using Go doesn’t mean you have to use many goroutines or must not do some manual memory management where it is the right thing to do.

This article nicely shows how optimizing your program yields more speed than randomly throwing goroutines at it. Finally it does use goroutines for a good effect, but after proper consideration.

>Kind of defeats the purpose of using golang for a task like this. The whole point of golang is using the little greenlet threads, but actually using them in this case is terrible on performance.

The point of Golang is using them intelligently, not merely throwing at any problem like all you've got is a hammer...

It's surprising that the per-file Goroutines were so expensive, though. (The original per-line Goroutine, sure, that's excessive if you care about performance.) Just using long-lived workers seems non-idiomatic for Go, but it certainly pays big dividends in this example.
Per-file may have had other problems not related to the Go runtime, such as IO contention. I'm not going to check it, but it would be easy to verify that just by using a limited number of them at a time. Spawning a new goroutine in that case is not strictly necessary, but would still be good software engineering.

One of the problems I see repeatedly when people try to benchmark things with concurrency is when they don't specify a problem that is CPU-intensive enough, so it ends up blocked on other elements of the machine. For a task like this, I'd expect optimized Go to easily keep up with a conventional hard drive, and with just a bit of work, come within perhaps a factor of 2 or 3 of keeping up with the memory bandwidth on a consumer machine (including the fact that since you're going to read a bit, then write some stuff, you're not going to get the full sequential read performance out of your RAM), not because Go is teh awesomez but because the problem isn't that hard. To get big concurrency wins, you need a problem where the CPU is chewing away at something but isn't constantly hitting RAM or disk or network for it, such that those systems become the bottleneck.

Hi jerf, please note that

- the benchmark was designed to repeatedly parse an in-memory byte slice (not the hard drive), thus IO contention is unlikely here ;

- concurrency is a big win when IO is a bottleneck : keep processing dozens of things while some of them are waiting for data from network or hdd.

"the benchmark was designed to repeatedly parse an in-memory byte slice (not the hard drive), thus IO contention is unlikely here"

You could still be getting IO contention from the RAM system. RAM is not uniformly fast; certain access patterns are much faster than others.

"concurrency is a big win when IO is a bottleneck : keep processing dozens of things while some of them are waiting for data from network or hdd."

Concurrency is a win when IO is a bottleneck on a single task. Once you've got enough tasks running that all your IO is used up, adding more may not only fail to speed things up, but may slow things down. I'm speaking of situations where you've used up your IO. The tasks you're benchmarking are so easy per-byte that I think there's a good chance you used up your IO, which at this level of optimization, must include a concept of memory as IO.

I think you'd be helped by stepping down another layer from the Go VM and thinking more about how the hardware itself works regardless of what code you are running on it. Go can't make the hardware do anything it couldn't physically do, and I'm getting a sense you deeply understand those limits.

Yes, it does feel that something is wrong somewhere, but I can't find out where. Nobody would be using the idiomatic goroutine-per-task with that kind of overhead, yet it's one of the most common building blocks of golang projects.
Hi iainmerrick, just for info the measured per-file cost didn't include reading from the filesystem. Only the in-memory parsing was taken into account.
I write single-threaded Go all the time, and you can use most of the same optimizations in Go that you could use in C (including pools). It's pretty easy to opt out of GC in Go. And you still keep all of the other benefits of using a modern, higher-level language (security, memory safety, straightforward tooling, etc).
It only does because the author is testing for their specific environment. At some number of cores, the concurrent calls will produce more performant code than running sequentially.

Ideally, in this case I would think one would want to check the number of cores and decide what route to take.

When the author removed parallelism the first time, I don't think this is the case. Running things in parallel has a cost. That cost often comes in the form of memory allocations and data copies so the unit of work can be stored and shared with another thread, and the synchronization costs of scheduling threads. If that aggregate cost is greater than the computational cost of what you're computing, you'll never win.

For the point at which the author removed parallelism, and the sequential code was faster, I think this was the case. The computation was too fine-grain. The author successfully took advantage of parallelism by applying it at a coarser granularity; each thread did more work. At this point, the author also does tune the solution for the execution environment, as he uses a fixed set of go-routines to process a bunch of messages rather than one go-routine per message.

scott_s you're totally right on both points.

FWIW I really mean the "take the numbers with a grain of salt" advice, i.e. "Your mileage may vary". What I'm sharing in this article is not a bunch of hard, strong, exact numbers ; It's a journey and an invitation to apply similar reasoning process to your own use case and hardware.

For the record, I enjoyed your post. It's a great example of what clear-headed performance optimization looks like.