Hacker News new | ask | show | jobs
by dragontamer 1484 days ago
Tldr: autovectorizer transforms the first code into SIMD instructions, but the second into 64 bit instructions.

SIMD is very powerful, and modern compilers can sometimes simd-ify your code automatically.

-------

The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form. I kinda want to give it a go myself but don't have time today...

Autovectorizers are mysterious, often harder to see and use than explicitly SIMD code (like OpenCL or CUDA).

7 comments

The other essential insight is that the second version has data dependencies between loop iterations. Without that, the tldr is incomplete.
That is the 'why' autovectorization fails. But it seems solvable in this case pretty easily actually.

EDIT: Solvable by a human. I dunno much about compilers though, so I dunno if there's some kind of language issue that would prevent the unrolling of this dependency.

I think the point GP is making is that even without vectorization, the data dependency causes stalls even in normal, single data instructions. That is, a data dependency between iterations of loops will hurt performance even for non-vectorizable calculations (or on CPUs with high ILP but no really good vector instructions, which granted is probably a small pool since both those things came about at about the same time).
I think that is the point Peter Cordes is trying to make (quite politely but firmly) over there on stackoverflow. It's not only about autovectorization. The main point there is the loop-carried dependency that prevents both the compiler (autovec) and the processor (ILP) to do their thing.

Loop-carried dependency is the big culprit here. I wish we had a culture of writing for loops with the index a constant inside the loop, as in the Ada for statement, and not the clever C while loops or for with two running variables... Simpler loop syntax makes so many static analyses 'easier' and kind of forces the brain to think in bounded independant steps, or to reach for higher level constructs (e.g. reduce).

I wonder if it becomes a math problem to optimize then, like Euler solving the sum of 1-100 by adding 1 to 100 and multiplying by the 50 pairs of numbers that operation created to get 5050?
It's a different issue altogether. Even in vectorized code artificial data dependencies are an issue. E.g., for fast dot products (not actually usually applicable on most input sizes because of memory bandwidth, but related problems can be CPU bound) you'll roughly double your execution speed by alternating between two running totals and merging those at the end, and that gain is mostly orthogonal to the question of whether those running totals are vectorized.

Edit: And many modern compilers have been doing that particular optimization for a few years anyway, but it's still an important idea to keep in mind for any non-trivial graph of operations.

> The second code can probably become SIMD as well, but it's beyond GCC's ability to autovectorizer it in that form.

Usually, for floating point operations the compiler simply has no chance to do anything clever. NaNs, infinities and signed zero mean that even the most "obvious" identities don't actually hold. For example, x + 0 == x (where the == is bitwise) does not hold for x = -0.

> For example, x + 0 == x (where the == is bitwise) does not hold for x = -0.

But when operating on floating point numbers, == is not bitwise, and -0 == +0.

If the user provides code where the result should be +0.0 and the compiler emits code that results in -0.0 (and the user has not explicitly enabled relaxed IEEE754 conformance), that's a bug in the compiler.
That is correct. But we were talking about how the == operator compares zeros. According to IEEE 754, -0 == +0 even though their bit patterns are not identical. And, by the way, inf != inf even if their bit patterns are identical.
> But we were talking about how the == operator compares zeros

Well, you were and I wasn't, and it wasn't clear whether you meant to contradict my point. I guess I could've made it more clear that I meant "using == to denote bitwise equality here only" and not "btw, == is bitwise for floats", but either way it should be clear now.

You mean nan right?
Yes. I mean nan != nan. I don’t know why I wrote inf. Thank you for correcting.
That doesn’t matter for the case where a compiler would like to replace x + y by x because it knows y == 0.

The compiler would have to test for the x == -0 case, and doing that ¿rarely/never? is faster than computing x + y.

TurboFan does reason about whether a value can be -0 and will strength reduce if it's known a -0 will not occur. Further, it reasons about observations of a value; if it is known that a -0 is never observed, e.g. the value is only ever converted to an integer, or compared, or otherwise only flows into expressions that treat 0 == -0, then it will do more aggressive optimizations.
There is a compiler switch for that too: -fassociative-math -funsafe-math-optimizations
I like how the flag combines in that second one to create "fun safe", pretty much flipping its meaning.
so x+0 is 0?
No. -0.0 + 0 is not necessarily bitwise-equal to 0.0. Though it arguably could be, depending on details.
-0.0 + x = x, even if x should be -0.0 or +0.0.

+0.0 + x, on the other hand, is not always equal to x.

Note that IEEE 754 defines exactly which zero you get in every combination of operands. -0 + 0 is actually 0.
How can one go about making one's code apt for a compiler to be able to do these kinds of things?
A good way is to have your data arranged in structs of arrays rather than in arrays of structs. This allows the compiler to generate code which just loads in linear sections of memory to SIMD registers. It’s also just more cache efficient in general.

Check out data oriented design if you aren’t already familiar.

Note that sometimes it's sufficient to have arrays of structs of arrays. That is you don't need to necessarily go whole hog.
That very much depends on access patterns. If you’re performing an operation I’ve ever object with a certain field, struct of array makes sense. If you’re doing an operation which uses many fields on some arbitrary dynamic randomly ordered subset of objects, then array of structs will yield better because at least you recover some memory locality.
You study autovectorizers, then you enable autovectorization warning flags, and carefully read your compilers output.

If the compiler says autovectorization failed, you rewrite the code until the autovectorizer works.

https://docs.microsoft.com/en-us/cpp/build/reference/qvec-re...

hm, I have had limited success with such nudging - it's certainly not a programming model with a predictable outcome. And as soon as the compiler or its flags are updated, maybe the situation changes.
I also wonder if there's any compiler that allows you to hint that you expect certain optinizations to occur (like vectorization), and if they do not, fails to compile at all.
Don’t bother. Autovectorization almost fundamentally doesn’t work.

You can write code using explicit vectors and it’s not too bad, though less cross platform than you’d like.

An actual answer involves things like restrict and alignment keywords.

Explicit vectors can be made portable (to SSE4/AVX2/AVX-512/NEON/SVE/SVE2/RISC-V V/WASM SIMD) using the Highway library which gives you wrapper functions over the platform's intrinsics, emulating missing functionality where required: github.com/google/highway. Disclosure: I am the main author.
The problem is not the ability of the compiler, is that it's not numerically equivalent, so the compiler isn't allowed to do that optimization.
How would the second approach be vectorized given each iteration's input has dependence on previous iteration's output??
Unroll the dependency until you are longer than the SIMD width.

Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Or in other words: i+7 can depend on i-1 no problems.

> Unroll the dependency until you are longer than the SIMD width.

> Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

    void compute()
    {
        const double A = 1.1, B = 2.2, C = 3.3;
        const double A128 = 128*A;
        double Y[8], Z[8];
    
        Y[0] =               C;
        Y[1] =     A +   B + C;
        Y[2] =   4*A + 2*B + C;
        Y[3] =   9*A + 3*B + C;
        Y[4] =  16*A + 4*B + C;
        Y[5] =  25*A + 5*B + C;
        Y[6] =  36*A + 6*B + C;
        Y[7] =  49*A + 7*B + C;
        Z[0] =  64*A + 8*B;
        Z[1] =  80*A + 8*B;
        Z[2] =  96*A + 8*B;
        Z[3] = 112*A + 8*B;
        Z[4] = 128*A + 8*B;
        Z[5] = 144*A + 8*B;
        Z[6] = 160*A + 8*B;
        Z[7] = 176*A + 8*B;
    
        int i;
        for(i=0; i<LEN; i+=8) {
            data[i  ] = Y[0];
            data[i+1] = Y[1];
            data[i+2] = Y[2];
            data[i+3] = Y[3];
            data[i+4] = Y[4];
            data[i+5] = Y[5];
            data[i+6] = Y[6];
            data[i+7] = Y[7];
            Y[0] += Z[0];
            Y[1] += Z[1];
            Y[2] += Z[2];
            Y[3] += Z[3];
            Y[4] += Z[4];
            Y[5] += Z[5];
            Y[6] += Z[6];
            Y[7] += Z[7];
            Z[0] += A128;
            Z[1] += A128;
            Z[2] += A128;
            Z[3] += A128;
            Z[4] += A128;
            Z[5] += A128;
            Z[6] += A128;
            Z[7] += A128;
        }
    }
> Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.

I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute.

I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.

So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.

One of these days, I'll actually remember that fact, lol.

On my machine, your code is faster for smaller LEN values. I'm not sure why this is though.
8x 64-bit is 512-bit, which is designed for AVX512. You'll probably need AVX512 to fully benefit from unrolling x8.

4x 64-bit is 256-bit, which requires special compiler flags for 256-bit AVX2, but most x86 CPUs should support them these days.

2x64-bit is 128-bit, which fits in default SSE 128-bit SIMD with default GCC / Visual Studio compiler flags.

If they were integer variables, I guess the compiler would have done that, but you can't really do that with floats because i+A+A is not necessarily i+2*A. (Of course, in this particular example, the difference doesn't matter for the programmer, but the compiler doesn't know that!)

I think there's some gcc option that enables these "dangerous" optimizations. -ffast-math, or something like that?

No the computer would have been unlikely to be able to figure out the math to coalesce 8 recursive additions into one operation.
> as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other

I really don't see how that works in improving this.

You can only calculate i+8, for calculating i+9 you depend on 8. And you can't go in strides either since i+16 depends on i+15 which you've not calculated so far unless you want to intermix the stateful and non-stateful code. I'd rather not go there.

both versions use SSE and are pipelined, the problem with the second one is data dependency, only two adds but the second one directly depends on the first ones result = stall
SSE includes "scalar" adds (addsd), which are a 1x floating point instruction. These are "non-SIMD" instructions, serving as a replacement for the legacy x87 instructions.

There is also "parallel" adds (addpd).

Carefully look at the assembly language, the 1st version uses parallel adds (addpd) and parallel multiplies. The 2nd version uses scalar adds (addsd)

The other major point is that the 2nd version uses a singular move qword (64-bit) per loop iteration, while the 1st version is using the full 128-bit move per loop iteration.

---------

SSE is used for scalar double-precision these days, because scalar-SSE is faster than x87 instructions... and better matches the standards (x87 had "higher precision" than the IEEE specs, so it has different results compared to other computers. SSE is closer to the specs)

> Tldr: autovectorizer transforms the first code into SIMD instructions, but the second into 64 bit instructions.

That's not the tldr. It's actually more wrong than right.

The actual TLDR is: loop dependencies prohibit the compiler _and_ your CPU from parallelizing. The SIMD part is the compiler, the more important part here is the CPU though.

Long explanation:

Let's start with the compiler. Yes, the first version can be vectorized and you can see that in the included screenshots of the assembly output. The use of addpd [0] (add 2 pairs of 64bit doubles in simultaneously) instead of addsd (add 1 pair of 64bit doubles). Both operations have identical throughput/latency on any architecture not too ancient (e.g. check information of the corresponding intrinsic here [3]. Unfortunately Agner instruction_tables doesn't include the ADDSD for most architectures.) So while you can crunch 2 operations using SIMD at the same time, you are still doing 3 multiplications and 2 additions instead of 2 additions. The multiplications and addition have the same throughput at least on Skylake. So we can use simple math and 5/2 is still more than 2!

Now to the CPU part. The loop dependency prohibits the compiler to properly utilize instruction pipelining [4] and now differences in throughput vs latency come into play. In a pipelined scenario the CPU can work on multiple steps of the instruction cycle in parallel (from fetching the instruction, then decoding it, to executing the operation and eventually writing the result back to the register) and thus achieve higher throughput. However, if your next instruction depends on the instruction before the CPU has to finish all the steps of a pipeline from instruction start to finish before the next instruction can start. This is the latency of an instruction. For example mulsd/mulpd have 8 times higher latency than throughput on Intel Skylake.

So while SIMD comes in at a factor of 2, the loop dependency stops the CPU from realizing a potential factor of 8.

Peter Cordes also correctly points this out in his comments and answer to the question on stackoverflow.

[1] https://www.felixcloutier.com/x86/addpd

[2] https://www.felixcloutier.com/x86/addsd

[3] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

[4] https://en.wikipedia.org/wiki/Instruction_pipelining

In short, the CPU can execute up to 6 or 8 instructions at once, but only if there are no dependencies between instructions. In the second version every operation depends on the previous result.