Hacker News new | ask | show | jobs
by pixelesque 28 days ago
Yep, same here and agree.

Compilers have definitely got better though: another issue in the past (maybe still is to a degree? although compilers have got a lot better at this in the past 15 years, but it used to be one of the things only Intel's ICC actually got right), that if you wrapped the base-level '__m128' or 'float32x4_t' in a struct/union in order to provide some abstraction, the compiler would often lose track of this when passing the struct/union through functions (either by value or const ref), and would often end up 'spilling' (not entirely the correct terminology in this context, but...) the variable from registers, and just producing asm which ended up uselessly loading the variable again from a stack address further up the call stack, when it didn't actually need to do that. So that was the situation even when using intrinsics within custom wrappers.

From 2011 to around 2013 ICC seemed to be the only compiler on amd64 which wouldn't do this. If you passed the actual '__m128' down the function call chain instead, clang and gcc would then do the right thing.

2 comments

Part of that could be ABI constraints. There are some surprising calling convention differences between a vector and a struct or union with vectors in it, and they vary platform to platform. E.g. on ARM a struct with two 128-bit vectors will pass in two registers where on x86 it must pass via the stack.

Using __attribute__ to tweak calling conventions can often really clean this up, but that's just as obscure and non-portable as the problem it fixes. So you either end up writing weird non-portable code one way or weird non-portable code another... Code working with these types doesn't get to benefit from zero-cost abstraction to the degree we're used to with normal scalar code.

That's an ABI constraint of the x86 32-bit API.

People invented x32 to fix this. Or just use amd64.

This was with amd64.

ICC was at the time the only compiler that would not do that.