|
|
|
|
|
by brigade
1477 days ago
|
|
Soo... basically a 2x speedup in going from 4x128b to 2x512b ALUs, after discounting the frequency difference. But realistically, Intel's client configurations are 3x256b, which is only 25-40% faster in that paper. (I suspect any application doing enough quicksort that the 2x speedup is significant, would be even happier going slightly off-core to a coprocessor more specialized in vector processing, like Hwacha. There's plenty of space between "tightly-coupled CPU SIMD" and "GPU" that I think makes more sense than needing to implement 512-bit registers in little cores.) |
|
2.4x difference was, in fact, reported, however I still find it somewhat difficult to interpret the reported results. The processing unit size difference alone and the number of LU's can't account for such a big difference in transfer speeds as the M1 Max that was used in the assessment has a very wide memory bus (256 bit wide for a performance core cluster or 512 bit wide for the entire SoC) as well as unusually large L1-D cache and a large L2 cache, with both caches having deep TLB's. The test set they used could also fully fit into the L2 cache. I have asked the Google engineer a question in a separate thread about what else could influence the observed performance difference but have not received a satisfactory explanation.