|
|
|
|
|
by adrian_b
591 days ago
|
|
Nope, no Apple CPU core has such performance. That single lib call must have used the AMX accelerator, which is separate from the cores and shared by a group of cores. So that AMX accelerator performance may be greater than of all CPU cores together. AFAIK, some Apple CPUs have one AMX accelerator for the big cores and another AMX accelerator for the smaller cores, but in any case there is no chance to hope that if you have obtained 1 TFLOP/s when running the program on 1 core you will get much more when running it on multiple cores, because all cores of the same type will use the same shared accelerator. |
|
nVidia tensor cores support int8, couple versions of FP16 (BF16 and the standard IEEE one) and FP19 which they call TensorFloat-32. I think Intel AMX only supports int8 and BF16.
None of them supports FP32 let alone FP64 input numbers, which makes them completely useless for traditional GEMM stuff.