Hacker News new | ask | show | jobs
by ra120271 634 days ago
For my education, would you be able to expand on what it doesn't work for and why? Thank you!
1 comments

Inference (token generation) is memory-bound, KV cache prefill (prompt processing) is compute-bound. The ARM Macintoshes have lots of memory bandwidth but not a lot of compute power, so they're great for outputting text but terrible for tasks like analyzing documents. I've never done fine-tuning but my understanding is that that is a highly-parallelizable compute hog as well.

You might like this article, which looks at the arithmetic intensity of LLM processing: https://www.baseten.co/blog/llm-transformer-inference-guide/