Hacker News new | ask | show | jobs
by janwas 1481 days ago
Thanks! Feel free to raise Github issues if you'd like to ask/discuss anything.

I'm also a huge fan of Godbolt/Compiler Explorer. Highway is integrated there, so you can just copy in your functions. Here's the last throwaway test that I was using: https://gcc.godbolt.org/z/5azbK95j9

> things might get better in the future but for now we have to implement it another way

There's several possible answers. 1) For the issue you pointed to, that we cannot have arrays of N vectors, it's a reasonable workaround to instead allocate an array of N vectors' worth of elements, and trust the compiler to elide unnecessary Load/Store. This often works using clang, and would avoid having to manually unroll here. I do prefer to minimize the amount of compiler magic required for good performance, though, so typically we're unrolling manually as shown in that code.

2) If there are compiler bugs, typically the workarounds have to stay in because in the open-source world people are still using that compiler years later.

Automatically detecting when things get better is an interesting idea but I am not yet aware of such infrastructure.

1 comments

  // Compiler doesn't make independent sum* accumulators, so unroll manually.
  // We cannot use an array because V might be a sizeless type. For reasonable
  // code, we unroll 4x, but 8x might help (2 FMA ports * 4 cycle latency).
That code needs 2 loads per FMA. So a CPU with 2 FMA ports would need at least 4 load ports to be able to feed the 2 FMA ports. Given that most CPUs with 2 FMA ports have just 2 load ports, unrolling by 4 should be more or less ideal.

But, ideally, the compiler could make the decision based on the target architecture.

Without enabling associative math, it isn't legal to duplicate floating point accumulators and change the order of the accumulation. Perhaps compiling under `-funsafe-math` would help. If you're using GCC, you'll probably need `-fvariable-expansion-in-unroller`, too.

I think highway looks great. I'm sure I'll procrastinate on something important to play with it reasonably soon.

Thanks :) I'd be interested to hear how it goes for you.

Agree that 4x unrolling is getting most of the low-hanging fruit without excessive code size. I saw only very slightly better performance on SKX with 8x.

You're right that it's nicer when the compiler can decide about the unrolling - for example with knowledge whether we have 16 or 32 regs. The unsafe/fast-math flags are pretty dangerous, though :/ https://simonbyrne.github.io/notes/fastmath/ Especially when they enable flush-to-zero, which would be unacceptable for a library loaded into some other application.