| TL;DR: No, nearly all these apps will use GPU (via Metal), or CPU, not Neural Engine (ANE). Why? I suggest a few main reasons:
1) No Neural Engine API
2) CoreML has challenges modeling LLMs efficiently right now.
3) Not Enough Benefit (For the Cost... Yet!) This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point! --- 1) No Neural Engine API - There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it. 2) CoreML has challenges modeling LLMs efficiently right now. - Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?). - CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros. - HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML. 3) Not Enough Benefit (For the Cost... Yet!) - ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]). - So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear. - Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above. I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts! Britt --- [1] https://github.com/huggingface/exporters/pull/37 [2] https://apple.github.io/coremltools/docs-guides/source/flexi... [3] https://huggingface.co/tiiuae/falcon-7b-instruct/tree/main/c... [4] https://huggingface.co/coreml-projects/Llama-2-7b-chat-corem... [5] https://github.com/huggingface/swift-transformers [6] https://github.com/huggingface/exporters [7] https://developer.apple.com/documentation/metal/gpu_devices_... [8] https://github.com/ml-explore/mlx/issues/18 [9] https://github.com/ggerganov/llama.cpp/issues/1714#issuecomm... [10] https://testflight.apple.com/join/ERFxInZg |
What would be the downside to padding all inputs to have consistent input token size?