| HN Mirror

Right, though your use of the terms "different registers" and "properly pipelined" is quite nuanced for a non-specialist. It leaves a lot to the imagination or prior experience!

If it is a compute-heavy code that was spilling to a stack and now can fit in the SIMD registers, there could be a speed up with less memory traffic. This is analogous to a program's working set either fitting in RAM or spilling to swap while it runs, but at a different level of the memory hierarchy. In the extreme case, your working set could be a fixed set of variables in expensive loops for some sequential algorithm, and so the compiler register placement ability and number of registers available could be the difference between effectively O(1) or O(n) memory accesses with respect to the number of compute steps.

Of course, you can find such transitions between registers/L1 cache, L1/L2 cache, L2/L3 cache (if present), local/remote RAM (for modern NUMA machines), etc. This happens as you transition from a pure CPU-bound case to something with bigger data structures, where the program focus has to move to different parts of the data to make progress. Naively holding other things equal, an acceleration of a CPU-bound code will of course mean you churn through these different data faster, which means more cache spills as you pull in new data and have to displace something else. Going back to the earlier poster's question, this cache spilling can have a detrimental effect on other processes sharing the same cache or NUMA node, just like one extremely bloated program can cause a virtual memory swapping frenzy and hurt every other program on the machine.

One form of "pipelining" optimization is to recognize when your loop is repeatedly fetching a lot of the same input data in multiple iterations and changing that to use variables/buffers to retain state between iterations and avoid redundant access. This often happens in convolution algorithms on arrays of multi-dimensional data, e.g. image processing and neural nets and many cellular-automata style or grid-based simulations. Your algorithm slides a "sampling window" along an array to compute an output value as a linear combination of all the neighboring values within the window using some filtering kernel (a fixed pattern of multiplicative scalars/weights). Naive code keeps loading this whole window once for each iteration of the loop to address one output position. It is effectively stateless except for the loops to visit every output. Pipelined code would manage a window buffer and fetch and replace only the fringe of the buffer while holding and reusing inputs from the previous iterations. While this makes the loops more stateful, it can dramatically reduce the number of memory reads and shift a memory-bound code back into a CPU-bound regime.

Other major optimizations for these N-dimensional convolutions are done by the programmer/designer: some convolution kernels are "decomposable" and the expensive multi-dimensional operation can be strength-reduced to a sequence of 1-dimensional convolutions with appropriately chosen filter kernels to produce the same mathematical result. This optimization has nothing to do with SIMD but simplifies the work to embody the operation. Fewer input values to read (e.g. a 1D line of neighbors instead of 2D box) and the number of arithmetic operations to do all the multiply-and-add steps to produce the output. Imagine a 2D filter that operates on an 8x8 window. The 2D convolution has to sample and combine 64 input neighbors per output, while the decomposed filter does 8 neighbors in each axis and so one quarter as many steps total after one pass for the X axis and one pass for the Y axis.

Naively, decompsition is done as separate 1D passes over the data and so performs more memory writes and allocates an additional intermediate array between the original input and final output. This is often a big win in spite of the extra memory demands. It's a lot more coding, but this could also be "pipelined" to some degree in N-dimensions to avoid pushing the entire intermediate results arrays through main memory. Approaches differ, but you can make different tradeoffs for how many intermediate results to store or buffer versus redundant calculation in a less stateful manner.

Generally, as you make the above kinds of optimizations your code becomes more "block-based", loading larger chunks of data with fewer changes to new "random" memory locations. This is very much like how databases and filesystems optimized their access to disk storage in prior decades to avoid random seeks for individual bytes or blocks and to instead use clever structures to support faster loading of sets of blocks or pages. When this is successful, your program does most of its memory access sequentially and achieves higher throughput that is bound by total memory bus bandwidth rather than random access latency limitations.

The final form of "pipelining" matters once you really hit your stride with these optimized, block-based programs. All this access again can cause cache spilling as you sustain high bandwidth memory access. But if you can provide the right hints to the CPU, or if the CPU is clever enough to detect the pattern, it can shift into a different cache management regime. Knowing that you will only access each block of data once and then move on (because you are holding the reusable state in registers), the CPU can use a different prefetching and spilling policy to both optimize for your next memory access and to avoid churning the whole cache with these values you will only read once and then ignore. This reduces the impact of the code on other concurrent tasks which can still enjoy more reasonable caching behavior in spite of the busy neighbor.