|
|
|
|
|
by stephencanon
590 days ago
|
|
Probably more relevant here is that a single CPU core on that computer exceeds 1 tflop/s on gemm with plenty of margin using a single lib call, and leaves the rest of the CPU cores and all of the GPU free to do other work. |
|
That single lib call must have used the AMX accelerator, which is separate from the cores and shared by a group of cores.
So that AMX accelerator performance may be greater than of all CPU cores together. AFAIK, some Apple CPUs have one AMX accelerator for the big cores and another AMX accelerator for the smaller cores, but in any case there is no chance to hope that if you have obtained 1 TFLOP/s when running the program on 1 core you will get much more when running it on multiple cores, because all cores of the same type will use the same shared accelerator.