| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Metus 1024 days ago
	Do these implementations use the neural engine? I saw that there was a stable diffusion implementation using the neural engine and I found that my macbook noticably did not run hot, as opposed to an average Teams call.

3 comments

snitty 1024 days ago

It doesn't. You need to generate models for use on the neural engine, which apple did for Stable Diffusion, but this is just taking advantage of lots of fast RAM and lots and lots of threads, if I understand it correctly.

link

ramesh31 1024 days ago

It uses Metal acceleration, and takes advantage of the shared memory architecture, meaning it's basically a GPU with 196GB VRAM. Trading space (VRAM) for time (FLOPs), it can beat the performance of an RTX4080 here.

link

lostmsu 1023 days ago

> can beat the performance of an RTX4080 here

This needs some backing. When M1 just got out people were claiming it is comparable to 3080, until they saw the performance difference.

link

ramesh31 1023 days ago

Read the PR

link

woadwarrior01 1024 days ago

Encoder only transformers (like BERT) can be made to run on neural engine with CoreML. Efficient inference with autoregressive encoder-decoder and decoder only transformers (aka LLMs) needs KV-caching, which currently can't be efficiently implemented with CoreML (and thus neural engine). So, for now it's GPU only, with Metal.

link

smpanaro 1024 days ago

You can do autoregressive decoding with KV caching on the Neural Engine. You have to make a bit of a trade off and use fixed size inputs [1] but the speed up over no caching is meaningful.

There's a Whisper (Encoder-Decoder) [2] implementation if you want to see it in practice. Shameless plug, but I have a repo [3] where I'm working on autoregressive text generation on the Neural Engine. I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching). Will push an update soon.

Without quantization you can't go much higher than 1.5B params on M1's Neural Engine. M2 seems to have a higher ceiling but I haven't measured. I'm optimistic (but have not tried) that the new runtime quantization added to CoreML this year will allow for larger (and maybe faster) models on both.

[1] Technically you should be able to use 1 input with an enumerated set of sizes but I haven't been able to get it to work on the Neural Engine. This would likely be even faster. [2] https://github.com/wangchou/whisper.coreml/ [3] https://github.com/smpanaro/more-ane-transformers/

link

cypress66 1023 days ago

>I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching).

That seems very slow compared to llama cpp?

link

smpanaro 1023 days ago

Yeah, I believe it is. You trade off speed for lower power usage and CPU. 8 tokens/sec is usable though.

link

GaggiX 1024 days ago

Autoregressive transformer models are usually memory bound, whereas SD is compute bound, so perhaps the difference lies here. Also the reason why SD runs so much faster on the GPU than on the CPU.

link

ninkendo 1024 days ago

M1 has (fast) unified memory between GPU and CPU, so something being memory bound ought not to have much bearing on whether it belongs on CPU or GPU… at least in theory. I’m a total noob here though so I may be wrong.

link

GaggiX 1024 days ago

We were discussing mostly about NPU, I don't know if it makes a difference.

link

lib-dev 1024 days ago

From https://en.wikipedia.org/wiki/Apple_M1#Memory

> The M1 uses a 128-bit LPDDR4X SDRAM in a unified memory configuration shared by all the components of the processor.

I assume that includes the NPU, media engine, etc.

link