Hacker News new | ask | show | jobs
by cbetti 1481 days ago
Not having played with SIMD much myself, does leveraging these instructions for an intensive operation like a sort push other workloads out of the CPU more aggressively than operating on 32 or 64 bits at a time would?

In other words, do you have to be more careful when integrating these wide operators to preserve some resources for other operations?

4 comments

At least on Intel, AVX has its own functional units and register file, so I would say it's not a major concern. It's possible that running some AVX instructions could be almost or completely free if you weren't using execution port 5 anyway, to the extent that instruction-level parallelism can be considered "free".

If you're really taking a microscope to performance, the main hazards would be intermittently using AVX for only a few instructions, because that might lead to the CPU stopping for a few microseconds to turn the power on and off on the functional units. If you're using them heavily the overall thermal situation might cause a core or package-wide clock rate degradation, but if you have a use case for sustained AVX-512 usage this is likely to be a good tradeoff.

Pushing them out of the CPU, I don't know, but some SIMD instruction sets on some CPUs have side effects that can negatively affect the performance of other operations. For example, the use of AVX2 / AVX-512 can cause some Intel CPUs to lower their base frequency, thus reducing the performance of simultaneous operations that are not using SIMD.
Is that still true? I was under the impression that post-Haswell CPUs don’t have that particular issue.
Not for recent hardware - Ice Lake eliminated the throttling on 256b ops (they already didn't exist for certain 256b ops and all smaller ops), and reduced the throttling to almost nothing for 512b ops. Rocket Lake eliminated the throttling for 512b ops.

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...

They do use a lot of power (and as a result, generate a lot of heat), so they can still cause thermal throttling, or throttling due to power limits - but there's no more "AVX offset"

They do emit a lot of heat, which might actually throttle the CPU overall.

But to my knowledge they use different registers, and when properly pipelined they don't hog the CPU cache like unoptimized algorithms that constantly round trip to RAM.

The CPU has a certain upper bound on power, TDP. That limit is independent of whether it is reached by executing scalar code on lots of cores, or SIMD on a few. One little-known fact is that out of order CPUs burn lots of energy just on scheduling instructions, ~10x more than the operation itself. SIMD amortizes that per-instruction overhead over multiple data elements, and actually increases the amount of useful work done per joule. We're talking 5-10x increase in energy efficiency.
Right, though your use of the terms "different registers" and "properly pipelined" is quite nuanced for a non-specialist. It leaves a lot to the imagination or prior experience!

If it is a compute-heavy code that was spilling to a stack and now can fit in the SIMD registers, there could be a speed up with less memory traffic. This is analogous to a program's working set either fitting in RAM or spilling to swap while it runs, but at a different level of the memory hierarchy. In the extreme case, your working set could be a fixed set of variables in expensive loops for some sequential algorithm, and so the compiler register placement ability and number of registers available could be the difference between effectively O(1) or O(n) memory accesses with respect to the number of compute steps.

Of course, you can find such transitions between registers/L1 cache, L1/L2 cache, L2/L3 cache (if present), local/remote RAM (for modern NUMA machines), etc. This happens as you transition from a pure CPU-bound case to something with bigger data structures, where the program focus has to move to different parts of the data to make progress. Naively holding other things equal, an acceleration of a CPU-bound code will of course mean you churn through these different data faster, which means more cache spills as you pull in new data and have to displace something else. Going back to the earlier poster's question, this cache spilling can have a detrimental effect on other processes sharing the same cache or NUMA node, just like one extremely bloated program can cause a virtual memory swapping frenzy and hurt every other program on the machine.

One form of "pipelining" optimization is to recognize when your loop is repeatedly fetching a lot of the same input data in multiple iterations and changing that to use variables/buffers to retain state between iterations and avoid redundant access. This often happens in convolution algorithms on arrays of multi-dimensional data, e.g. image processing and neural nets and many cellular-automata style or grid-based simulations. Your algorithm slides a "sampling window" along an array to compute an output value as a linear combination of all the neighboring values within the window using some filtering kernel (a fixed pattern of multiplicative scalars/weights). Naive code keeps loading this whole window once for each iteration of the loop to address one output position. It is effectively stateless except for the loops to visit every output. Pipelined code would manage a window buffer and fetch and replace only the fringe of the buffer while holding and reusing inputs from the previous iterations. While this makes the loops more stateful, it can dramatically reduce the number of memory reads and shift a memory-bound code back into a CPU-bound regime.

Other major optimizations for these N-dimensional convolutions are done by the programmer/designer: some convolution kernels are "decomposable" and the expensive multi-dimensional operation can be strength-reduced to a sequence of 1-dimensional convolutions with appropriately chosen filter kernels to produce the same mathematical result. This optimization has nothing to do with SIMD but simplifies the work to embody the operation. Fewer input values to read (e.g. a 1D line of neighbors instead of 2D box) and the number of arithmetic operations to do all the multiply-and-add steps to produce the output. Imagine a 2D filter that operates on an 8x8 window. The 2D convolution has to sample and combine 64 input neighbors per output, while the decomposed filter does 8 neighbors in each axis and so one quarter as many steps total after one pass for the X axis and one pass for the Y axis.

Naively, decompsition is done as separate 1D passes over the data and so performs more memory writes and allocates an additional intermediate array between the original input and final output. This is often a big win in spite of the extra memory demands. It's a lot more coding, but this could also be "pipelined" to some degree in N-dimensions to avoid pushing the entire intermediate results arrays through main memory. Approaches differ, but you can make different tradeoffs for how many intermediate results to store or buffer versus redundant calculation in a less stateful manner.

Generally, as you make the above kinds of optimizations your code becomes more "block-based", loading larger chunks of data with fewer changes to new "random" memory locations. This is very much like how databases and filesystems optimized their access to disk storage in prior decades to avoid random seeks for individual bytes or blocks and to instead use clever structures to support faster loading of sets of blocks or pages. When this is successful, your program does most of its memory access sequentially and achieves higher throughput that is bound by total memory bus bandwidth rather than random access latency limitations.

The final form of "pipelining" matters once you really hit your stride with these optimized, block-based programs. All this access again can cause cache spilling as you sustain high bandwidth memory access. But if you can provide the right hints to the CPU, or if the CPU is clever enough to detect the pattern, it can shift into a different cache management regime. Knowing that you will only access each block of data once and then move on (because you are holding the reusable state in registers), the CPU can use a different prefetching and spilling policy to both optimize for your next memory access and to avoid churning the whole cache with these values you will only read once and then ignore. This reduces the impact of the code on other concurrent tasks which can still enjoy more reasonable caching behavior in spite of the busy neighbor.

Generally yes to the first question, no to the second.

If you want your code to have low perturbance on other concurrent work done by the system, implementing it in a inefficient way doesn't usually help with that goal, since then your code will just be executing all the time because it takes a long time to finish. And you still won't have good control of the execution resources compared to using normal tools like OS scheduling policies to address the goal.