| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 314 days ago
	Prompt pre-processing (before the first token is output) is raw compute-bound. That's why it would be nice if we could direct llama.cpp/ollama to run that phase only on iGPU/NPU (for systems without a separate dGPU, obviously) and shift the whole thing over to CPU inference for the latter token-generation phase. (A memory-bound workload like token gen wouldn't usually run into the CPU's thermal or power limits, so there would be little or no gain from offloading work to the iGPU/NPU in that phase.)