Hacker News new | ask | show | jobs
by mgaunard 33 days ago
I made the first proposal to the C++ standard committee to introduce SIMD in 2011, before Matthias Kretz got involved with his own version (which is what became std::simd). This was based on what eventually became Eve (mentioned in the article).

Back then, it was rejected, for the same arguments that people are making today, such as not mapping to SVE well, having a separate way to express control flow etc.

There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language. Then that died out (I'm not sure why), and SIMD became trendy, so the committee was more open to doing something to show that they were keeping up with the times.

3 comments

> There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language.

I think this is the best solution for truely portable SIMD. Sure it doesn't cover everything, but it makes autovec explicit, guaranteed and more powerfull.

One of the biggest problems with "portable" SIMD libraries, is that when it's used for simple things, often autovec is better, as it has access to the direct ISA semantics and can much easier do things like unrolling.

To me it’s clear adding the ability to express intent to parallelise is the Right Thing. This is the only way the compiler can actually know what you want it to do.
Trying to abstract over SVE with a SIMD library is a bit of a fool's errand. The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it. All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.

Frankly, the length agnostic stuff is a mistake that I hope hardware designers will eventually see the light on, like delay slots.

> Trying to abstract over SVE with a SIMD library is a bit of a fool's errand

It reallt isn't. You just make the default SIMD-width agnostic and anything less portable opt-in.

You can still specialize for a specific width pn scalabe vector ISAs.

> The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it.

Such as?

> All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.

google highway doesn't. And while Arm is stuck with 128-bit SVE, because they alsp have to implement NEON as fast as possible to be competitive, RVV already has a large diversitly of hardware with different vector length available 128,256,512,1024.

    Such as?
I have a database that has big columns that get functions applied to them to compute the result set. This is a perfect case for length agnostic instructions, except out ends up horribly memory bound. A nice optimization is to only compute those lanes containing rows that might actually be in the result set by keeping track of a sparse record that depends on the lane size. But the cnt instructions are optional, and this also inhibits compiler optimizations in that lookup.
CNT and CNTP don't seem to be optional for SVE, from what I found. (unless you mean HISTCNT)

It seems to me like you want tp use CNTP on a bitset that tells you, which rows are relevant, skipping them if CNT is 0? Is that what you where describing?

I was confused and thinking that streaming mode and CNT were in separate extensions, but they're both in SME. My bad.

Anyway, essentially yes. My previous comment didn't mention all of the context. The join enforces that the result set is the intersection of the individual column sets, so it gets increasingly sparse as individual columns are computed. So I just maintain a bit tree that says which columns could populate the result set and skip computing the other lanes, which depends on the vector width and benefits from knowing it at compile time.

I'm no C++ dev, but as an outsider, it sure reads like the whole "int is variable length" mistake again.
In a way it's worse because at least with int you're not really expecting to run the same binary on architectures with different int lengths, and also for several decades there have only been two realistic options (32 or 64), which makes it easy to deal with.

With RVV (and SVE I assume) there are a wider range of realistic options - at least 128, 256 and 512. The RVV spec allows up to 65536! Also it's totally reasonable to want a single binary to work with all of them so then you're into compiling parts of your code multiple times with runtime dispatch which is a right pain.

I haven't looked into how Highway does it but I don't really know you you write length-agnostic code in high level languages. It's easy in assembly, but it sucks if you have to do it in assembly.

Here is a highway example: https://gcc.godbolt.org/z/7sdPr61W6

There is a bit of boilerplate to get dynamic dispatch working, but apart from that it's quite simple to use.

That's a mistake for ABI visible types, yes.
That abstraction is occasionally usable in low level systems code, that is why Go, Rust, D and C# support it as well.

Also to note that is C not C++.

I don't know how SVE works but I thought the point of it was to let implementations pick a larger size than the CPU supports and then get an automatic speedup from better processors with more vector lanes.