| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wmu 3733 days ago
	Not sure about "single code path". Differences amid SIMD flavors are significant, there are cases when translation one-to-one is either impossible or unpractical. A bright example are AVX2 instructions operating on 128-bit lanes rather whole 256-bit registers. And wrappers exists in the C++ ecosystem, C programmers are stuck to intrinsics.

2 comments

exDM69 3732 days ago

> And wrappers exists in the C++ ecosystem, C programmers are stuck to intrinsics.

If you can accept working with GNU extensions that are available in recent-ish GCC and Clang (but not MSVC, not sure about Intel ICC), there are pretty nice vector extensions [0].

With them you can get standard binary operators working for arithmetic (+,-,*,/ etc) and shuffling with __builtin_shuffle. These are CPU independent, the same code compiles neatly to ARM NEON as well as x86 SSE+AVX+FMA. All you need is a typedef with an __attribute__.

The vector extension functions don't cover the whole instruction sets but the vector types are compatible with _mm128 and NEON native formats so you can resort to intrinsics when necessary.

However, for a lot of SIMD tasks I encounter, just basic arithmetic + shuffles is more than 80% of what I need.

If you want to see some examples, take a look at my collection of 3d graphics and physics related SIMD routines [1]. (note: this project could use some help, let me know if you're interested in doing something with it or porting some of the hand optimized routines to more used math libs like glm)

[0] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html#Ve... [1] https://github.com/rikusalminen/threedee-simd

link

wmu 3732 days ago

> If you can accept working with GNU extensions that are available in recent-ish GCC and Clang

I do my private project in C++ so it's not a case, but at my current company we use also MSVC. I wish we could abandon that compiler and work with GCC or clang only.

> However, for a lot of SIMD tasks I encounter, just basic arithmetic + shuffles is more than 80% of what I need.

Your remaining 20% is my 80%. :)

link

exDM69 3732 days ago

> ... but at my current company we use also MSVC. I wish we could abandon that compiler and work with GCC or clang only.

Good news! These days you can produce MSVC compatible binaries with Clang or even use Clang as a compiler from the C++ IDE.

Whether or not you can do this in practice is another matter, but it can be done.

> Your remaining 20% is my 80%. :)

Yeah, if you look at my examples, they're rather straightforward arithmetic with 4 dimensional vectors. There's very little need for any integer arithmetic or more exotic combinations of operations. A little fused multiply-and-add here and there.

But I haven't seen a better method for this, most of the code is CPU-agnostic and will compile to x86 or ARM code using all the available instruction sets (depending on compiler arguments, e.g. -mavx2 or -march=native). I really haven't seen a SIMD math lib with so little duplication for different CPUs elsewhere.

link

johnt15 3733 days ago

The property of AVX and AVX2 you mentioned actually helps having single code path. If the SIMD wrapper allows parameterization on vector width (most do that), you can simply increase vector width when compiling for AVX and that's it.

link

wmu 3733 days ago

I understand you point, however it not as simple as it seems. Of course, for trivial code transition between different SIMD flavors could be seamless. But the world is cruel. :)

Think about shuffling instructions (pshufb), lookup vector for the instruction are different in AVX2 and SSE. Even if an AVX2 vector could be created by cloning SSE vector twice, this must be a programmer decision.

Another example is algorithm using video-encoding instruction mpsadbw to locate substrings (http://0x80.pl/articles/sse4_substring_locate.html#introduct...). AVX2 instruction vmpsadw operates on 128-bit lanes and the algorithm have to be rewritten in some parts to align with this limitation.

link