Hacker News new | ask | show | jobs
by tehmillhouse 1389 days ago
As a compiler guy, I'd appreciate some look at the layers of abstraction in-between (so, ASM). Microbenchmarks are famously jittery on modern CPUs due to cache effects, branch prediction, process eviction, data dependencies, pipeline stalls, OoO execution, instruction level parallelism, etc.

You have to be really careful to ensure the numbers you're getting are really coming from the thing you're trying to benchmark, and aren't corrupted beyond recognition by your benchmarking harness.

Some of these questions would surely be trivial if I actually knew any Go, but I'm left wondering:

* What does the machine code / assembly look like for this? What does the cast compile down to?

* What's `int` an alias for? I assume 64-bit-signed-integer?

* Are integer casts checked in go? Would an overflowing cast fault?

5 comments

* https://godbolt.org/z/3dEdha44W

* Architecture based, so 32 or 64 bit signed integer.

* No faults. Signed become -1 and unsigned become MAX.

This is using GCC Go which no-one actually uses.

Edit: why I'm being downvoted: https://go.godbolt.org/z/699d7KjWr

I for one develop on GCC-Go. The only reason I chose Go as my next project's language is because it has a GCC implementation (I have strict rule about GPL licensed development tool chains for my own projects).
Anyone that cares about the extra performance out of Go code that the decades old battle tested GCC backend is able to provide.
I did not test but I assume the Go compiler is "much" faster than the "semi"old GCC implementation.
As far as I can see, GCCGo implements Go 1.18 as of today [0]. Moreover, as noted on official Golang webpage for GCCGo [1], there are even some advantages for using GCCGo, quoting:

"On x86 GNU/Linux systems the gccgo compiler is able to use a small discontiguous stack for goroutines. This permits programs to run many more goroutines, since each goroutine can use a relatively small stack. Doing this requires using the gold linker version 2.22 or later. You can either install GNU binutils 2.22 or later, or you can build gold yourself."

Considering GCC can optimize well written code to the point of saturating the target processor, given the correct target flags, I'm not entirely sure that "gc" would be "much" faster than GCCGo. I'm relatively new with Go, but equally old with GCC, esp. with g++, so assuming the optimization prowess is equally valid for GCCGo.

Last but not least, GCCGo is a part of GCC as a primary language since 4.7, which is an eternity in software terms.

[0]: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgo/VERSION;h=...

[1]: https://go.dev/doc/install/gccgo

It was in 2019 I doubt it changed but the default Go compiler is overall faster than GCCGO: https://meltware.com/2019/01/16/gccgo-benchmarks-2019.html
I was referring to the quality of generated machine code.
I was also talking about that compiled code.
I think he meant the performance of the compiled binary, not the build.
All are good points. Also, loop unrolling and induction variable analysis. LLVM is particularly aggressive at trying to optimize inductions. It will literally turn a "sum of i from 0 to N" into "n(n+1)/2", among others.

It's really important to look at the actual machine code.

An aside.

> It will literally turn a "sum of i from 0 to N" into "n(n+1)/2", among others[1]

Yeah, seen that on a godbolt youtube vid. Question is, should it do this? Or should it force you to use a library, by reporting what you're trying to do and telling you there's an easier way ( "sum of 1 to n is <formula>, instead of a loop use library function 'sum1toN()" )

I think getting too clever risks hurting the user by not letting them know there's a better way.

[1] actually it seems to do a slightly different version of this to prevent risk of overflow, but same result.

One issue is that an optimization like this could have resulted from many prior inlinings and foldings that had to work together to create the pattern that was finally matched, so it's not obvious to point the user to what they should change their code to. Maybe they actually intended that the code get folded at the end by the compiler, but all of those optimizations were done by templates that will get folded a different, but equally cool way, under different specializations.

Generally I think some of LLVM's optimizations are trying too hard for the amount of complexity and compilation overhead they create. All that complexity comes at the risk of bugs. With optimizations around UB, it becomes downright mind-boggling what could go wrong. But I'm not an LLVM maintainer so what the heck do I know.

micro benchmarks are especially problematic in real-world code where you load stuff from random addresses in memory.

If the code after the cast is blocked on a memory load, then you have a lot of free instructions while the cpu is waiting for the memory load to complete. In this case it doesn't matter if the cast is free or takes a handfull of instructions.

Sometimes code becomes faster by using more instructions to make the data more compact so more of the data stays in the caches.

Microbenchmarks are only valid for the one function under test, and only on the current machine; they're all right for optimizing on particular hardware, but not so much to go out into the world and go "X is faster than Y"

That said, I did like this website where you could set up JS benchmarks, they would run on your own machine and you could compare how it ran on other people's systems. It wasn't perfect, but it gave a decent indication if X was faster than Y. Of course, it's a snapshot in time, JS engines have gone through tons of optimizations over the years.

My point is that the context of the function can invalidate a microbenchmark completely.

If you only call this function once in a while, then the context is more important than the function.

You can only ignore the context when you do video decoding or matrix inversion or similar "context free" long running code.

int is defined to be the pointer width (or something like this), so probably int64 where the OP is running their code.

integer casts are unchecked.

and is it aligned properly