| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Gracana 635 days ago
	Inference (token generation) is memory-bound, KV cache prefill (prompt processing) is compute-bound. The ARM Macintoshes have lots of memory bandwidth but not a lot of compute power, so they're great for outputting text but terrible for tasks like analyzing documents. I've never done fine-tuning but my understanding is that that is a highly-parallelizable compute hog as well. You might like this article, which looks at the arithmetic intensity of LLM processing: https://www.baseten.co/blog/llm-transformer-inference-guide/