| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by llm_nerd 1107 days ago
	The neural engine on all recent Apple silicon (and A## devices) has "tensor" cores for matrix calculations (note: Apple abstracts all of this behind coreml so there is some conflation between the ANE and AMX instructions/hardware). The M2 Ultra offers 31.6 trillion ops per second with fp16, for instance, which actually bests an A100. The software support is terrible, of course, which is the biggest limitation, but Apple clearly wants to be in that realm as well.

1 comments

bufo 1107 days ago

The neural engine has severe limitations at the moment. I tried using it for BERT about a year ago and kept crashing its API because of "out of memory" issues. The theoretical TOPs you mention also don't necessarily translate into usable TOPs because of memory bandwidth and caches. This is why for example the comparison of the M1 Max with a RTX 3090 was completely off.

link

llm_nerd 1107 days ago

I certainly can't speak to your specific uses or issues, but I mean we've really moved the goalposts from the prior claim that it didn't have tensor (e.g. matrix) functionality.

My daily work life includes a lot of model running on Apple hardware (Apple Silicon and A1# chips with the neural engine) using CoreML, often Pytorch models converted using coremltools. The performance of the Apple chips is spectacular if the intrinsics are supported (things obviously get dicier if there are currently unsupported ops). I mean, the memory bandwidth of the M2 Ultra is within spitting distance of the GDDR6X 4090.

People aren't going to be replacing H100 arrays with Apple Silicon and even as a fan I use nvidia hardware for training and convert the models to CoreML after the fact, but Apple clearly isn't just satisfied being some toy. They are continually climbing up that vine.

link

bufo 1107 days ago

Yes, you are correct in that the ANE does have the equivalent of tensor cores and that I didn’t mention that. I just don’t expect it to be usable beyond inference because the number of compute units will not work for batches in medium/large/huge networks. That’s obviously by design! The ANE silicon size is tiny compared to the GPU area. I wouldn’t be actually surprised if Apple strategically only invests in using their GPU for LLM (1B+ params) work.

Note that if you are currently using CoreML for LLMs all the work is done in the GPU.

link