| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brittlewis12 892 days ago

TL;DR: No, nearly all these apps will use GPU (via Metal), or CPU, not Neural Engine (ANE).

Why? I suggest a few main reasons: 1) No Neural Engine API 2) CoreML has challenges modeling LLMs efficiently right now. 3) Not Enough Benefit (For the Cost... Yet!)

This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!

---

1) No Neural Engine API

- There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.

2) CoreML has challenges modeling LLMs efficiently right now.

- Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).

- CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.

- HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.

3) Not Enough Benefit (For the Cost... Yet!)

- ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).

- So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.

- Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.

I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!

Britt

---

[1] https://github.com/huggingface/exporters/pull/37

[2] https://apple.github.io/coremltools/docs-guides/source/flexi...

[3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c...

[4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem...

[5] https://github.com/huggingface/swift-transformers

[6] https://github.com/huggingface/exporters

[7] https://developer.apple.com/documentation/metal/gpu_devices_...

[8] https://github.com/ml-explore/mlx/issues/18

[9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm...

[10] https://testflight.apple.com/join/ERFxInZg

2 comments

joeconway 892 days ago

This is really interesting, thank you.

What would be the downside to padding all inputs to have consistent input token size?

link

brittlewis12 891 days ago

Conceptually, to the best of my understanding, nothing too serious; perhaps the inefficiency of processing a larger input than necessary?

Practically, a few things:

If you want to have your cake & eat it too, they recommend Enumerated Shapes[1] in their coremltools docs, where CoreML precompiles up to 128 (!) variants of input shapes, but again this is fairly limiting (1 tok, 2 tok, 3 tok... up to 128 token prompts.. maybe you enforce a minimum, say 80 tokens to account for a system prompt, so up to 200 tokens, but... still pretty short). But this is only compatible with CPU inference, so that reduces its appeal.

It seems like its current state was designed for text embedding models, where you normalize input length by chunking (often 128 or 256 tokens) and operate on the chunks — and indeed, that’s the only text-based CoreML model that Apple ships today, a Bert embedding model tuned for Q&A[2], not an LLM.

You could used a fixed input length that’s fairly large; I haven’t experimented with it once I grasped the memory requirements, but from what I gather from HuggingFace’s announcement blog post[3], it seems that is what they do with swift-transformers & their CoreML conversions, handling the details for you[4][5]. I haven’t carefully investigated the implementation, but I’m curious to learn more!

You can be sure that no one is more aware of all this than Apple — they published "Deploying Transformers on the Apple Neural Engine" in June 2022[6]. I look forward to seeing what they cook up for developers at WWDC this year!

---

[1] "Use `EnumeratedShapes` for best performance. During compilation the model can be optimized on the device for the finite set of input shapes. You can provide up to 128 different shapes." https://apple.github.io/coremltools/docs-guides/source/flexi...

[2] BertSQUAD.mlmodel (fp16) https://developer.apple.com/machine-learning/models/#text

[3] https://huggingface.co/blog/swift-coreml-llm#optimization

[4] `use_fixed_shapes` "Retrieve the max sequence length from the model configuration, or use a hardcoded value (currently 128). This can be subclassed to support custom lengths." https://github.com/huggingface/exporters/pull/37/files#diff-...

[5] `use_flexible_shapes` "When True, inputs are allowed to use sequence lengths of `1` up to `maxSequenceLength`. Unfortunately, this currently prevents the model from running on GPU or the Neural Engine. We default to `False`, but this can be overridden in custom configurations." https://github.com/huggingface/exporters/pull/37/files#diff-...

[6] https://machinelearning.apple.com/research/neural-engine-tra...

link

swyx 891 days ago

great high effort answer, thanks so much!

to prod you to sell yourself a bit more - what is the goal/selling point of cnvrs?

link

brittlewis12 891 days ago

Oh man I’m a big fan, swyx!! Latent Space & AI.engineer are fantastic resources to the community. Thank you for the kind words & the prompt!

It’s still early days, but at a high level, I have a few goals: - expand accessibility and increase awareness of the power & viability of small models — the scene can be quite impenetrable for many! - provide the an easy to use, attractive, efficient app that’s a good platform citizen, taking full advantage of Apple’s powerful device capabilities; - empower more people to protect their private conversation data, which has material value to large AI companies; - incentivize more experimentation, training & fine-tuning efforts focused on small, privately-runnable models.

I’d love to one day become your habitual ChatGPT alternative, as high a bar as that may be.

I have some exciting ideas, from enabling a user generated public gallery of characters; to expanding into multimodal use cases, like images & speech; composing larger workflows on top of LLMs, similar to Shortcuts; grounding open models against web search indices for factuality; and further out, more speculative ideas, including exposing tools like JavaScriptCore to models as a tool, like Python in ChatGPT’s code interpreter.

But I’m sure you’ve also given a lot of thought to the future of AI on device with smol — what are some dreams you have for truly private AI that’s always with you?

link

swyx 891 days ago

i dont dream of truly private ai like that haha. im a pretty open book. but very very glad to see more options in the local ai space!

link