|
|
|
|
|
by mtklein
119 days ago
|
|
If I remember correctly, the AVX2 feature set is a fairly direct upscale of SSE4.1 to 256 bit. Very few instructions even allowed interaction between the top and bottom 128 bits, I assume to make implementation on existing 128 bit vector units easier. And the most notable new things that AVX2 added beyond that widening, fp16 conversion and FMA support, are also present in NEON, so I wouldn't expect that to be the issue either. So I'd bet the issue is either newness of the codebase, as the article suggests, or perhaps that it is harder to schedule the work in 256 bit chunks than 128. It's got to be easier when you've got more than enough NEON q registers to handle the xmms, harder when you've got only exactly enough to pair up for handling ymms? |
|
That would be plain AVX, AVX2 has shuffles across the 128-bit boundary. To me that seems like the main hurdle for emulation with 128-bit vectors, in my experience compilers are very eager to emit shuffle instructions if allowed, and emulating a 256-bit shuffle with 128-bit operations would require 2 shuffles and a blend for each half of the emulated register.
EDIT: I just noticed that the benchmark in the article is pure math which probably wouldn't hit this particular issue, so this doesn't explain the performance difference...