| HN Mirror

If you're running an LLM there's a benefit in shifting prompt pre-processing to the NPU. More generally, anything that's memory-throughput limited should stay on the GPU, while the NPU can aid compute-limited tasks to at least some extent.

The general problem with NPUs for memory-limited tasks is either that the throughput available to them is too low to begin with, or that they're usually constrained to formats that will require wasteful padding/dequantizing when read (at least for newer models) whereas a GPU just does that in local registers.