Hacker News new | ask | show | jobs
by brigade 1474 days ago
> (even if they can actually execute all instructions concurrently, which is not clear to me and unlikely - Intel can also only execute some instructions on certain ports)

It's pretty symmetrical, moreso than Cortex-X2's 4 pipelines; there's analysis that on M1 only some floating point and crypto instructions can't execute on all 4 pipelines. [1] (TP in that table is inverse throughput)

Which means that, for example, byte permutes from tables 256b or less can actually achieve the same throughput on M1 as with Intel's AVX-512, since M1 can sustain 4x 2-register TBL per cycle. And doing the exact equivalent of a 512b vpermb (3 cycle latency, 1/cycle throughput) can be done with 5 cycles latency and 0.33/cycle throughput on M1, via 4x 4-register TBL.

Well, a vpermd in NEON would need an extra MLA to convert indexes, and vpermi2* equivalents fall off a cliff. And Intel still has p01 free, and COMPACT is SVE. But in general, a lot of the parallelism that enables AVX-512 implementations will convert directly into ILP across 128b vectors.

> This seems unlikely because we're sorting 8 MB and my understanding is that cores (unless L2==LLC) generally have private, partitioned L2 caches, so 3 MB in the case of M1. Is that incorrect?

Anandtech [2] measured the same L2 latency up to about 8MB single-core, so regardless of the details, 8MB is a pretty significant cliff on M1. Regardless, RAM bandwidth is ~60GB/s, and unlike Intel, can be just about saturated by a single core.

[1] https://dougallj.github.io/applecpu/firestorm-simd.html

[2] https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

1 comments

Thanks for the instruction table, hadn't seen that yet. That is indeed remarkably symmetrical! It does reinforce the "half of AVX-512" result - we sustain ipc=2 on Skylake (with 512-bit) and it looks like M1 would sustain 4 (x128 bit).

huh, that's surprising, that plot indeed looks like a core might be grabbing more than 'its share' of L2, though not all. The 'full random' curve starts creeping up after ~3MB as expected, so the situation seems to be even more complex than "use up to 8MB".

For completeness I'll also measure for 100M elements single core, though on M1 that wouldn't make a difference because as you say, a single core can drive a lot of memory bandwidth, enough that NEON becomes the bottleneck.