Hacker News new | ask | show | jobs
by brigade 1477 days ago
Soo... basically a 2x speedup in going from 4x128b to 2x512b ALUs, after discounting the frequency difference. But realistically, Intel's client configurations are 3x256b, which is only 25-40% faster in that paper.

(I suspect any application doing enough quicksort that the 2x speedup is significant, would be even happier going slightly off-core to a coprocessor more specialized in vector processing, like Hwacha. There's plenty of space between "tightly-coupled CPU SIMD" and "GPU" that I think makes more sense than needing to implement 512-bit registers in little cores.)

2 comments

> Soo... basically a 2x speedup in going from 4x128b to 2x512b ALUs, after discounting the frequency difference. But realistically, Intel's client configurations are 3x256b, which is only 25-40% faster in that paper.

2.4x difference was, in fact, reported, however I still find it somewhat difficult to interpret the reported results. The processing unit size difference alone and the number of LU's can't account for such a big difference in transfer speeds as the M1 Max that was used in the assessment has a very wide memory bus (256 bit wide for a performance core cluster or 512 bit wide for the entire SoC) as well as unusually large L1-D cache and a large L2 cache, with both caches having deep TLB's. The test set they used could also fully fit into the L2 cache. I have asked the Google engineer a question in a separate thread about what else could influence the observed performance difference but have not received a satisfactory explanation.

I am that engineer; happy to clarify/go into more detail. First, just to ensure we're on the same page: the speeds we quote are sort goodput, i.e. amount of sorted data produced per time. Memory bus is only incidental to keeping the vector units fed.

The key bottleneck is partitioning. AVX-512 does really well there because it has dedicated compressstore instructions, and it's actually even faster to partition a vector via vperm* (because we only need to do that once, whereas two compressstore are required to partition). So AVX-512 reaches >25 GB/s partition throughput per core; it's instead limited by the memory bandwidth each core can access (around 11 GB/s if a single core is active, less when all are competing for the total "128 GB/s").

By contrast, NEON for example in the M1 has 128-bit vectors. Its "4 vector units" (even if they can actually execute all instructions concurrently, which is not clear to me and unlikely - Intel can also only execute some instructions on certain ports) are definitely not as good as actual 512 bit vectors, because partitioning only has a left and right side, and we don't have enough ILP for each of those to keep 2 vector units busy. Hence NEON reaches 11 GB/s partition throughput. It would seem like this matches Skylake, but no: once a subarray fits into cache, Skylake is freed from the memory bottleneck and is at least twice as fast there (which is a sizable fraction of the total sort time).

Does this help explain the results?

> The test set they used could also fully fit into the L2 cache.

This seems unlikely because we're sorting 8 MB and my understanding is that cores (unless L2==LLC) generally have private, partitioned L2 caches, so 3 MB in the case of M1. Is that incorrect?

> (even if they can actually execute all instructions concurrently, which is not clear to me and unlikely - Intel can also only execute some instructions on certain ports)

It's pretty symmetrical, moreso than Cortex-X2's 4 pipelines; there's analysis that on M1 only some floating point and crypto instructions can't execute on all 4 pipelines. [1] (TP in that table is inverse throughput)

Which means that, for example, byte permutes from tables 256b or less can actually achieve the same throughput on M1 as with Intel's AVX-512, since M1 can sustain 4x 2-register TBL per cycle. And doing the exact equivalent of a 512b vpermb (3 cycle latency, 1/cycle throughput) can be done with 5 cycles latency and 0.33/cycle throughput on M1, via 4x 4-register TBL.

Well, a vpermd in NEON would need an extra MLA to convert indexes, and vpermi2* equivalents fall off a cliff. And Intel still has p01 free, and COMPACT is SVE. But in general, a lot of the parallelism that enables AVX-512 implementations will convert directly into ILP across 128b vectors.

> This seems unlikely because we're sorting 8 MB and my understanding is that cores (unless L2==LLC) generally have private, partitioned L2 caches, so 3 MB in the case of M1. Is that incorrect?

Anandtech [2] measured the same L2 latency up to about 8MB single-core, so regardless of the details, 8MB is a pretty significant cliff on M1. Regardless, RAM bandwidth is ~60GB/s, and unlike Intel, can be just about saturated by a single core.

[1] https://dougallj.github.io/applecpu/firestorm-simd.html

[2] https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

Thanks for the instruction table, hadn't seen that yet. That is indeed remarkably symmetrical! It does reinforce the "half of AVX-512" result - we sustain ipc=2 on Skylake (with 512-bit) and it looks like M1 would sustain 4 (x128 bit).

huh, that's surprising, that plot indeed looks like a core might be grabbing more than 'its share' of L2, though not all. The 'full random' curve starts creeping up after ~3MB as expected, so the situation seems to be even more complex than "use up to 8MB".

For completeness I'll also measure for 100M elements single core, though on M1 that wouldn't make a difference because as you say, a single core can drive a lot of memory bandwidth, enough that NEON becomes the bottleneck.

If something is bound by memory/cache bandwidth, then increasing ALU width wouldn't help in the first place.
This is an interesting question. Has there been much uptake of such accelerators/coprocessors? One concern is that by the time the HW is ready, SW wants to do something different, perhaps fusing some other step with the sort. Another is deployability: everyone has SIMD/vectors on board, but even GPUs aren't quite everywhere nor so easy to scale out.

Also, there are now several RISC-V CPUs with 512-bit vectors, and it seems fair to call them little cores especially compared to x86 and M1/M2. Perhaps 512-bit is more feasible (and sensible) than is widely believed?