Hacker News new | ask | show | jobs
by CoolGuySteve 2729 days ago
Last time I used SSE intrinsics, which was GCC 4.9 I think, I had a lot of trouble with register usage. It looked like it was compiling down to use only one SSE register for everything instead of parralelizing across them.

I tried the same algorithm in godbolt with some clang versions and it was slightly better, using two or three registers, but not by much. So I had to break it into inline assembly.

I wonder if GCC has improved since then.

2 comments

> It looked like it was compiling down to use only one SSE register for everything instead of parralelizing across them.

Yeah, that's a common problem and leads to nasty dependency stalls. MSVC is horrible in the same way, at least 2015. Haven't tried newer versions yet. Intel's ICC seems to generate good code most of the time.

> I wonder if GCC has improved since then.

Yes, it has. I've written a lot of SIMD code and spent a good amount of time reading the compiler assembly output and there has been huge improvement over the last decade.

GCC register allocation wasn't great, then it got better with x86 SSE but still sucked at ARM NEON, and now it seems to be decent with both.

Clang was better at SIMD code before GCC was. It was equally good with SSE and NEON.

In my experience, compilers are much better than humans at instruction scheduling. Especially when using portable vector extensions, you don't have to write the same code twice and then tweak the scheduling for every architecture separately.

> In my experience, compilers are much better than humans at instruction scheduling.

It'd be more accurate to say they're much better than humans when the heuristics or whatever they use works. Sometimes the compiler messes up badly.

The workflow is often to compile and then examine disassembly to see whether the compiler managed to generate something sensible or not.

Other issue is that compiler pattern matching is sometimes not working and generating correct SIMD instruction. Even when data is SIMD width aligned. For example, recently I saw ICC not generating a horizontal add in the most basic scenario imaginable. * shrug *.

Things like this make me question the wisdom of ever using higher level languages. We took the path of abstracting our description of what we want to happen away from processor instructions with the idea that we could write code that could then compile on multiple architectures without changes, but the reality is that we still often need to special case things even without performance considerations, and the farther we abstract the more performance seems to be impacted and the more often we seem to end up jumping through abstraction hoops rather than getting things done.

The minimalist in me wonders if maybe just using some kind of macro system on top of assembler plus a bytecode VM with the ability to drop to native instructions wouldn't ultimately be better.