Hacker News new | ask | show | jobs
by woadwarrior01 790 days ago
Is this news? I've got a nearly year old app that supports over 2 dozen local LLMs with support for using them with Siri and Shortcuts. I added support for Llama 3 8B the day after it came out and also Eric Hartford's new Llama 3 8B based Dolphin model. All models in it are quantized with OmniQuant. On iOS, 7B and 8B ones are 3-bit quantized and smaller models are 4-bit quantized. On the macOS version all models are 4-bit OmniQuant quantized. 3-bit Omniquant quantization is quite comparable in perplexity to 4-bit RTN quantization that all the llama.cpp based apps use.

https://privatellm.app/

https://apps.apple.com/app/private-llm-local-ai-chatbot/id64...

1 comments

Nice. What is battery life like under heavy use? I was reading a thread on the llama.cpp repo earlier where they were discussing whether it was possible (or attractive) to add neural engine support in some form.
With bigger 7B and 8B models, the battery life goes from a over a day to a few hours on my iPhone 15 Pro.

The 8B model nominally works on 6GB phones but it's quite slow on them. OTOH, it's very usable on iPhone 15 Pro/ Pro Max devices and even better on M1/M2 iPads.

Every framework: llama.cpp, MLX, mlc-llm (which I use) all only use the GPU. Using the ANE and perhaps the undocumented AMX coprocessor for efficient decoder only transformer inference is still an open problem. I've made early some progress on quantised inference using ANE, but there 're still a lot of issues to be solved before it is even demo ready, let alone a shipping product.

Super interesting, thank you!