Hacker News new | ask | show | jobs
by sampo 1480 days ago
In the post, multiplications and 2 additions are not faster than 2 additions.

The post compares (1) loop code that can be vectorized, as loop rounds are independent and do not depend on the result from the previous round, and (2) an "optimization" that makes calculations shorter, but also makes each loop round depend on the result of the previous round, so this cannot be vectorized.

7 comments

More generally, “stop storing state, unless it makes the program less complicated.”

The first version is simple. The second version is more complicated. The simplicity is because the first version can be represented as a formula; imagine trying to write out the second version as a formula, and the complexity becomes obvious.

(The addition loop would have to be written as a recurrence relation, which of course means every step depends on the previous step.)

Complexity isn’t always obvious. It’s sometimes common in C programs to write a loop like dst[index++] = src[i], particularly in deeply nested for-loops. In my experience it’s almost always worth rewriting it so that the indices are computable entirely from the loop iteration variables, with no state. It helps you build a mental map of the operations, because you can think of it in terms of geometry (memcpy = copying rectangles onto other larger rectangles) whereas when the index is stateful it becomes quite a lot harder to visualize in higher dimensions. At least for me.

We’ve been building a compiler at Groq for our custom ML hardware. (Roughly, “turn pytorch into our ISA with no extra programmer effort.”) I used to think of posts like this as “Well, modern compilers are so fast, who really cares about autovectorization?” — it turns out you care when you need to write your own compiler. :)

MLIR is pretty cool. It makes a lot of transformations like this pretty easy. The MLIR “krnl” dialect can also automatically transform nested loops into tiled iteration. Graphics devs will know what I mean — no need to loop over 8x8 blocks of pixels manually, just write “for x in range(width): for y in range(height): …” and set the block size to 8.

> stop storing state, unless it makes the program less complicated

I had hoped that functional programming languages would lead us to automatic high-level parallelization. That hasn't happened much in the real world. But what we got is "write code like a functional programmer would even in a low-level language and the compiler and the ISA will parallelize it for you." That surprises me, but I'll take it.

Yeah I mean any functional operation on a list (I'm thinking of JS mainly) should be parallelizable, assuming the interpreter can determine that the order of execution does not matter.

I recall that Scala was the first language that introduced FP to me, and they made the developer explicitly tell that a functional operation could be done in parallel, after which the compiler and runtime would take care of the rest.

This was implemented in the extreme in a project that allowed you to write a functional style series of operations, which were translated to Hadoop map and reduce jobs, each of which could be executed in parallel at large scale. So very compact and succinct code, but very powerful.

It's because under the hood modern processors are embarrassingly parallel, so chip manufacturers like Intel were incentivized to put money into solving the hard problem "turn code written by devs who were trained on old serial-style languages (like C) into something that can pass a benchmark or we can't justify selling new chips because programs won't be faster."

The relative novelty and paucity of functional languages in the ecosystem means the incentive story hasn't happened to get the end-to-end from compiler through to execution so tight yet. I expect it'll happen, and when it does I expect the result will trigger fewer questions like this one because a newer generation of programmers well be less inclined to assume serial execution.

I think it's that it's easier to detect a lack of data dependencies in C than it is to write an efficient Haskell compiler. Well, that and the fact that there is at least an order of magnitude more people working on the former than the latter
memcpy is more like copying a single line, to copy a rectangle you need to call it in a loop or use specialized blit hw/sw.

How is MLIR different from LLVM IR?

And tile/block processing is not that novel for generic compute, graphics use it for other reasons - mostly how the pixel data is in memory and how the pipeline scales with tile size.

LLVM is geared towards providing a single toolchain for frontends to target, but is essentially made to target von Neumann/serial-ish/homogenous machines. You can get an overview of its philosophy in Lattner’s original 2004 paper [1]. MLIR (Lattner 2021 [2]) aims to generalize the compiler toolchain to support different compute paradigms and heterogeneous targets (polyhedral compilation/parallel kernels/afine loops) and so on, by defining a declarative framework of IR _dialects_ and transforms between them. I’m writing my MSc thesis on using C/C++ as the IR, assuming it has enough compiler support for different targets to provide a suitable platform for optimisation transforms using a source-to-source compiler, but MLIR seems like a better first-principles approach, maybe suffering from the impracticality of using it _right now_.

[1] "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation", Lattner and Adve. https://llvm.org/pubs/2004-01-30-CGO-LLVM.pdf

[2] “MLIR: Scaling Compiler Infrastructure for Domain Specific Computation”, Lattner et al. https://ieeexplore.ieee.org/document/9370308

> In the post, multiplications and 2 additions are not faster than 2 additions.

Reminded me of when I tried to optimize some code using assembly on x86, maybe 8-10 years go. The compiler spit out a DIV [EDI] instruction which was inside a hot loop. I saw how I could easily rewrite it to use a register instead of the indirect memory access.

So I did, and it was significantly slower, like 10-15%... I even took care to align the loop target address and so on. If I changed my code to use the memory indirection it was measurably faster.

And with that I decided hand-optimizing x86 assembly was definitely a skill to leave behind.

It would be interesting to test it on 486, Pentium, and Pentium Pro, as you get much different execution and caching behavior between those architectures.
You probably added an extra instruction or two to put the value in a register? The CPU can already split memory accesses into uops and cached accesses are fast, so there's no point in doing that because it'll just waste an additional register (vs. using one of the many the renamer generates) and add instructions to decode.

x86 is fundamentally a CISC; if you treat it like a RISC, it will definitely disappoint.

Not really, the compiler did the round-trip into the local variable (memory), I did it via register.

I asked around at the time and someone mentioned that I might have overtaxed certain execution ports or something like that, but yeah that just cemented my belief that x86 optimization is not my cup of tea anymore. Better to spend time learning how to write code the compiler can optimize well.

It's hard to know exactly without staring at the code and knowing the exact CPU, because nothing stands out as a red flag. If you used an extra register, maybe you caused a spill (pushed something else out of registers into memory). Maybe you made the loop code bigger and it no longer fit in the loop stream buffer (if that model had one). Maybe you hit a weird frontend decode issue and it could only decode N-1 instructions per clock in that line instead of N, and that was critical to the loop's performance. Maybe your code layout changed for another reason, or the memory layout, and you got some bad cache aliasing.

These things are knowable if you have enough curiosity and maybe masochism :)

What made me give up was when I found that an optimization for one x86 processor was an impairment on another. Your DIV example might have had a different outcome on an AMD chip, or on the next- or previous-generation part from Intel.

Nowadays, counting cycles is a game for monks who don't actually have to get anything done.

The compiler doesn’t know anything about optimizing x86 code either. The actual details there are too secret for Intel to want to accurately describe them in gcc, they’re different across different CPUs, and compilers just aren’t as good as you think they are.

(But CPUs are usually better than you think they are.)

It's not so secret, TBH. Usually the intel microarchitecture manuals are detailed enough to describe how many and what type of execution ports there are, how many stages in the pipeline, the size of the reorder buffer, latency of most u-ops, and any frontend hazards. The super secret stuff are things like the design of the branch predictors, memory disambiguation, etc, as well as the low-level tricks to optimize each of these down to the fewest gate delays (for high clockspeeds, etc), as well as where and how they figure out placement, etc.
The front end (the decoder stage and branch predictors) are what would theoretically be important for compilers as they’re the bottleneck. But Intel’s optimization advice doesn’t say much about branches anymore, they pretty much want you to rely on them to take care of it.

That’s only part secrecy and part to give them freedom to change it. It is of course somewhat described in their patents.

Would it not be in Intel's best interest to have popular compilers be able to squeeze the most performance out of its own line of CPUs?

I'm wondering how the incentives play out to keep this stuff private?

Intel sells a compiler. I've only used it briefly a long time ago, but its code generation was well ahead of MSVC at the time even for scalar (non-SIMD) stuff, and I remember GCC was far behind too (it would generate roughly the same performance in microbenchmarks, but far more bloated.)
Cached accesses are not fast anymore. Modern chips can retire a dozen or even more instructions in the time it takes to move a word in from L1 cache.
The other issue is instruction-level parallelism, as another poster in TFA pointed out. Even within a single loop iteration the "unoptimized" code is more likely to exploit multiple ALUs if they exist, regardless of vectorization instructions.
Ok, we've reverted the title to what the article says.

(Submitted title was "Multiplications and 2 additions are faster than 2 additions")

To be precise, they're both "vectorized" in the sense that both versions are using SSE vector instructions (and in fact, Clang will even generate AVX instructions for the first version if you use -mavx2). The difference is really the data dependency which has a massive effect on the ability of the CPU to pipeline the operation.

For the first version w/ AVX I get:

$ perf stat ./a.out [-] Took: 225634 ns.

Performance counter stats for './a.out':

            247.34 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             2,009      page-faults:u             #    8.122 K/sec                  
       960,495,151      cycles:u                  #    3.883 GHz                    
     2,125,347,630      instructions:u            #    2.21  insn per cycle         
        62,572,806      branches:u                #  252.982 M/sec                  
             3,072      branch-misses:u           #    0.00% of all branches        
     4,764,794,900      slots:u                   #   19.264 G/sec                  
     2,298,312,834      topdown-retiring:u        #     48.2% retiring              
        37,370,940      topdown-bad-spec:u        #      0.8% bad speculation       
       186,854,701      topdown-fe-bound:u        #      3.9% frontend bound        
     2,242,256,423      topdown-be-bound:u        #     47.1% backend bound         

       0.247734256 seconds time elapsed

       0.241338000 seconds user
       0.004943000 seconds sys

For the second version with SSE and the data dependency I get:

$ perf stat ./a.out [-] Took: 955104 ns.

Performance counter stats for './a.out':

            975.30 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             2,010      page-faults:u             #    2.061 K/sec                  
     4,031,519,362      cycles:u                  #    4.134 GHz                    
     3,400,341,362      instructions:u            #    0.84  insn per cycle         
       200,073,542      branches:u                #  205.140 M/sec                  
             3,192      branch-misses:u           #    0.00% of all branches        
    20,091,613,665      slots:u                   #   20.600 G/sec                  
     3,110,995,283      topdown-retiring:u        #     15.5% retiring              
       236,371,925      topdown-bad-spec:u        #      1.2% bad speculation       
       236,371,925      topdown-fe-bound:u        #      1.2% frontend bound        
    16,546,034,782      topdown-be-bound:u        #     82.2% backend bound         

       0.975762759 seconds time elapsed

       0.967603000 seconds user
       0.004937000 seconds sys
As you can see the first version gets nearly 3x better IPC (2.21 vs 0.84) and spends half as much time being backend bound.
> To be precise, they're both "vectorized" in the sense that both versions are using SSE vector instructions

No, just because it's using SSE doesn't mean it's vectorized. SSE has both scalar and vector instructions.

The first comment on the code does say precisely this.
SIMD vectorization is like quantum mechanics, it becomes classical if you look.