| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by smpanaro 1024 days ago

You can do autoregressive decoding with KV caching on the Neural Engine. You have to make a bit of a trade off and use fixed size inputs [1] but the speed up over no caching is meaningful.

There's a Whisper (Encoder-Decoder) [2] implementation if you want to see it in practice. Shameless plug, but I have a repo [3] where I'm working on autoregressive text generation on the Neural Engine. I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching). Will push an update soon.

Without quantization you can't go much higher than 1.5B params on M1's Neural Engine. M2 seems to have a higher ceiling but I haven't measured. I'm optimistic (but have not tried) that the new runtime quantization added to CoreML this year will allow for larger (and maybe faster) models on both.

[1] Technically you should be able to use 1 input with an enumerated set of sizes but I haven't been able to get it to work on the Neural Engine. This would likely be even faster. [2] https://github.com/wangchou/whisper.coreml/ [3] https://github.com/smpanaro/more-ane-transformers/

1 comments

cypress66 1024 days ago

>I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching).

That seems very slow compared to llama cpp?

link

smpanaro 1023 days ago

Yeah, I believe it is. You trade off speed for lower power usage and CPU. 8 tokens/sec is usable though.

link