Hacker News new | ask | show | jobs
by Const-me 590 days ago
One interesting thing about these newfangled matrix/AI/ML accelerators that’s very rarely mentioned on the internets, they only deliver that many TFLOP because they operate in very low precision.

nVidia tensor cores support int8, couple versions of FP16 (BF16 and the standard IEEE one) and FP19 which they call TensorFloat-32. I think Intel AMX only supports int8 and BF16.

None of them supports FP32 let alone FP64 input numbers, which makes them completely useless for traditional GEMM stuff.

2 comments

https://github.com/corsix/amx indicates that Apple's AMX supports up to 64-bit FP, but I don't see any performance metrics. They also have the ANE, which is the low-precision ML-focused accelerator.
I wasn’t aware there’re two completely different things from different companies both called AMX. I assumed that AMX: https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions

The Apple’s version is indeed interesting. I wonder why haven’t Apple exposed it to programmers, or implemented a BLAS library on top of that thing?

> I wonder why haven’t Apple exposed it to programmers, or implemented a BLAS library on top of that thing?

Using the Accelerate framework (which includes Apple's BLAS) is the only supported way for programmers to access the AMX. Reverse engineering the instruction set to access it directly is discouraged, because it's not a documented stable interface.

It’s because Apple’s AMX was an early, in-house built version of what was eventually released by Arm as SME. Apple adopted SME in the M4 and dropped AMX, but as long as you were using their Accelerate framework instead of directly writing AMX code (which they told people not to do), you wouldn’t notice.

Now that they’re using “standard” SME, it shouldn’t be a problem to write SME assembly opcodes directly, although I suspect Apple themselves is still probably sparse on the documentation. I’m not aware if there’s any way to use intrinsics or something slightly higher level than inline-ASM, but lower level than the Accelerate framework.

Apple's matrix unit supports FP16, 32, and 64 sources and accumulators.