Hacker News new | ask | show | jobs
by Gracana 635 days ago
Inference (token generation) is memory-bound, KV cache prefill (prompt processing) is compute-bound. The ARM Macintoshes have lots of memory bandwidth but not a lot of compute power, so they're great for outputting text but terrible for tasks like analyzing documents. I've never done fine-tuning but my understanding is that that is a highly-parallelizable compute hog as well.

You might like this article, which looks at the arithmetic intensity of LLM processing: https://www.baseten.co/blog/llm-transformer-inference-guide/