| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Hyzer 1160 days ago

Are there any other projects/libraries that can run Llama models on Apple Silicon GPU? This is the first one I've seen.

Comparing it to llama.cpp on my M1 Max 32GB, it seems at least as fast just by eyeballing it. Not sure if the inference speed numbers can be compared directly.

vicuna-7b-v0 on Chrome Canary with the disable-robustness flag: encoding: 74.4460 tokens/sec, decoding: 18.0679 tokens/sec = 10.8ms per token

llama.cpp: $ ./main -m models/7B/ggml-model-q4_0-ggjt.bin -t 8 --ignore-eos = 45 ms per token