Do these implementations use the neural engine? I saw that there was a stable diffusion implementation using the neural engine and I found that my macbook noticably did not run hot, as opposed to an average Teams call.
It doesn't. You need to generate models for use on the neural engine, which apple did for Stable Diffusion, but this is just taking advantage of lots of fast RAM and lots and lots of threads, if I understand it correctly.
It uses Metal acceleration, and takes advantage of the shared memory architecture, meaning it's basically a GPU with 196GB VRAM. Trading space (VRAM) for time (FLOPs), it can beat the performance of an RTX4080 here.
Encoder only transformers (like BERT) can be made to run on neural engine with CoreML. Efficient inference with autoregressive encoder-decoder and decoder only transformers (aka LLMs) needs KV-caching, which currently can't be efficiently implemented with CoreML (and thus neural engine). So, for now it's GPU only, with Metal.
You can do autoregressive decoding with KV caching on the Neural Engine. You have to make a bit of a trade off and use fixed size inputs [1] but the speed up over no caching is meaningful.
There's a Whisper (Encoder-Decoder) [2] implementation if you want to see it in practice. Shameless plug, but I have a repo [3] where I'm working on autoregressive text generation on the Neural Engine. I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching). Will push an update soon.
Without quantization you can't go much higher than 1.5B params on M1's Neural Engine. M2 seems to have a higher ceiling but I haven't measured. I'm optimistic (but have not tried) that the new runtime quantization added to CoreML this year will allow for larger (and maybe faster) models on both.
Autoregressive transformer models are usually memory bound, whereas SD is compute bound, so perhaps the difference lies here. Also the reason why SD runs so much faster on the GPU than on the CPU.
M1 has (fast) unified memory between GPU and CPU, so something being memory bound ought not to have much bearing on whether it belongs on CPU or GPU… at least in theory. I’m a total noob here though so I may be wrong.