|
|
|
|
|
by smpanaro
977 days ago
|
|
You can do autoregressive decoding with KV caching on the Neural Engine. You have to make a bit of a trade off and use fixed size inputs [1] but the speed up over no caching is meaningful. There's a Whisper (Encoder-Decoder) [2] implementation if you want to see it in practice. Shameless plug, but I have a repo [3] where I'm working on autoregressive text generation on the Neural Engine. I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching). Will push an update soon. Without quantization you can't go much higher than 1.5B params on M1's Neural Engine. M2 seems to have a higher ceiling but I haven't measured. I'm optimistic (but have not tried) that the new runtime quantization added to CoreML this year will allow for larger (and maybe faster) models on both. [1] Technically you should be able to use 1 input with an enumerated set of sizes but I haven't been able to get it to work on the Neural Engine. This would likely be even faster.
[2] https://github.com/wangchou/whisper.coreml/
[3] https://github.com/smpanaro/more-ane-transformers/ |
|
That seems very slow compared to llama cpp?