Hacker News new | ask | show | jobs
by emadm 841 days ago
Should look at MLX optimisations too, Stable LM 1.6b which is about the same size and quantises to 4bit really well runs 100 tok/s on a M2 Mac mini.

https://x.com/awnihannun/status/1750986911827832992?s=20

1 comments

I'm just about to ship an update to the iOS version my offline LLM app which will replace its current 3B default model (RedPajama Chat) with Stable LM 1.6B. Works extremely well even when quantized. I initially wanted to ship it with TinyLlama Chat, but TinyLlama and its fine tunes are quite subpar and many of my beta testers complained that it's much worse than even the old 3B model and then I found StableLM 2 Zephyr 1.6B. :)

https://imgur.com/a/Imd2l9o