|
|
|
|
|
by inkyoto
1477 days ago
|
|
> Soo... basically a 2x speedup in going from 4x128b to 2x512b ALUs, after discounting the frequency difference. But realistically, Intel's client configurations are 3x256b, which is only 25-40% faster in that paper. 2.4x difference was, in fact, reported, however I still find it somewhat difficult to interpret the reported results. The processing unit size difference alone and the number of LU's can't account for such a big difference in transfer speeds as the M1 Max that was used in the assessment has a very wide memory bus (256 bit wide for a performance core cluster or 512 bit wide for the entire SoC) as well as unusually large L1-D cache and a large L2 cache, with both caches having deep TLB's. The test set they used could also fully fit into the L2 cache. I have asked the Google engineer a question in a separate thread about what else could influence the observed performance difference but have not received a satisfactory explanation. |
|
The key bottleneck is partitioning. AVX-512 does really well there because it has dedicated compressstore instructions, and it's actually even faster to partition a vector via vperm* (because we only need to do that once, whereas two compressstore are required to partition). So AVX-512 reaches >25 GB/s partition throughput per core; it's instead limited by the memory bandwidth each core can access (around 11 GB/s if a single core is active, less when all are competing for the total "128 GB/s").
By contrast, NEON for example in the M1 has 128-bit vectors. Its "4 vector units" (even if they can actually execute all instructions concurrently, which is not clear to me and unlikely - Intel can also only execute some instructions on certain ports) are definitely not as good as actual 512 bit vectors, because partitioning only has a left and right side, and we don't have enough ILP for each of those to keep 2 vector units busy. Hence NEON reaches 11 GB/s partition throughput. It would seem like this matches Skylake, but no: once a subarray fits into cache, Skylake is freed from the memory bottleneck and is at least twice as fast there (which is a sizable fraction of the total sort time).
Does this help explain the results?
> The test set they used could also fully fit into the L2 cache.
This seems unlikely because we're sorting 8 MB and my understanding is that cores (unless L2==LLC) generally have private, partitioned L2 caches, so 3 MB in the case of M1. Is that incorrect?