Hacker News new | ask | show | jobs
by janwas 572 days ago
I hope people aren't writing directly to AVX2. When using a wrapper such as Highway, you get exactly this kind of update after a recompile, or even just running your code on a CPU that supports newer instructions.

The cost is that the binary carries around both AVX2 and AVX-512 codepaths, but that is not an issue IMO.

4 comments

Many use cases for SIMD aren't trivially expressible through wrappers and abstractions. It is sometimes cleaner, easier, and produces more optimized codegen to write the intrinsics directly. It isn't ideal but it often produces the best result for the effort involved.

An issue with the abstractions that does not go away is that the optimal code architecture -- well above the level of the SIMD wrappers -- is dependent on the capabilities of the silicon. The wrappers can't solve for that. And if you optimize the code architecture for the silicon architecture, it quickly approximates writing architecture-specific intrinsics with an additional layer of indirection, which significantly reduces any notional benefit from the abstractions.

The wrappers can't abstract enough, and higher level abstractions (written with architecture aware intrinsics) are often too use case specific to reuse widely.

Wrappers can be zero-overhead, so any claim of better codegen vs the underlying intrinsics sounds dubious. "best result for the [higher] effort involved" also contradicts my experience, so I ask for evidence.

One counterexample: our portable vqsort [1] outperforms AVX-512-specific intrinsics [2].

I agree that high-level design may differ. You seem aware that Highway, and probably also other wrappers, supports specializing code for some target(s), but possibly misunderstand how, given the "additional layer of indirection" claim. Wrappers give you a portable baseline, and remove some of the potholes and ugly syntax, but boil down to inlined wrapper functions.

If you want to specialize, that is supported. And what is the downside? Even if you say the benefit of a wrapper is reduced vs manually written intrinsics (and reinventing all the workarounds for their missing instructions), do you not agree that the benefit is still nonzero?

[1]: https://github.com/google/highway/tree/master/hwy/contrib/so... [2]: https://github.com/Voultapher/sort-research-rs/blob/38f37eef...

The downside is that you write an implementation in Highway, find that it doesn't perform how you want, and then you have to rewrite it.
Curious - how is/was performance helped by rewriting? Why not reach out to us, to see if it can be fixed in the library - wouldn't that be cheaper than rewriting?
I’ve moved on to other things so I can’t really give details anymore. I understand this is annoying to hear as someone who works on that library but I also want to say that your comment is also annoying for different reasons, which mostly answer your question so I’ll explain anyway.

Highway is (I feel not very controversially) kind of like a compiler but worse at its job. It’s not meant to be as general and it only targets a limited set of code, namely code that is annotated to vectorize well. But looking at it as a compiler is kind of useful: it’s supposed to make writing faster code easier and more automatic. Sometimes compilers are not able to do this, just as Highway can’t either. Maybe its design lacks the expressiveness to represent the algorithm people want. Perhaps it doesn’t quite lower to the optimal code. Maybe it turns out that so little of the operation maps to the constructs that a huge amount needs to go through the escape hatch that you offer, at which point it’s not really worth using the library anyway. In that situation, given an existing and friendly relationship, I would be happy to reach out. But this is a cost to me, because I need to simplify and generalize the thing I want. Then I hand it to you and you decide how you want to tackle it, if at all. All the while I’m waiting and I have code that needs to be written. This is a cost, and something that as an engineer I weigh against just using the intrinsics directly, which I know do exactly what I need but with higher upfront and maintenance costs. When you see someone write their own assembly instead of letting the compiler do it for them, they’re making their version of the same tradeoff.

Thank you for sharing your thoughts!

> it’s supposed to make writing faster code easier and more automatic Agree with this viewpoint. I suppose that makes it compiler-like in spirit, though much simpler.

I also agree that waiting for input/updates is a cost. What still surprises me, is that you seem to be able to do something differently with intrinsics, while believing this is not possible as a user of Highway. It is indeed possible to call _mm_fixupimm_pd(v1.raw, v2.raw, v3.raw, imm), and the rest of your code can be portable. I would be surprised if heavy usage were made of such escape hatches, but it's certainly interesting to discuss any cases that arise.

I do respect your decision, and that you make clear that raw intrinsics have higher upfront and maintenance costs. I suppose it's a matter of preference and estimating the return on the investment of learning the Highway vocabulary (=searching x86_128-inl.h for the intrinsic you know).

Personally, I find the proliferation of ISAs makes a clear case against hand-written kernels. But perhaps in your use case, only x86 will continue to be the only target of interest. Fair enough.

Most video encoders and decoders consist of kernels with hand written SIMD instructions/intrinsics.
Agreed. FWIW we demonstrated with JPEG XL (image codec, though also with animation 'video' support) that it is possible to write such kernels using the portable Highway intrinsics.
I would wager that most real world SIMD use is with direct intrinsics.
> I hope people aren't writing directly to AVX2.

Did you not read the article? It's using AVX intrinsics and NEON intrinsics.

I did, and I truly do not understand why some people do this. As shown in the reddit comments on this article [1], the initial intrinsics version was quite suboptimal and clearly worse than portable code [2].

When not busy unnecessarily rewriting everything for each ISA, it is easier to see and have time for vital optimizations such as unrolling :)

[1]: https://www.reddit.com/r/cpp/comments/1gzob1g/understanding_... [2]: https://github.com/google/highway/blob/master/hwy/contrib/do...