Hacker News new | ask | show | jobs
by wtallis 161 days ago
If you ask NVIDIA, inference should always run on the GPU. If you ask anybody else designing chips for consumer devices, they say there's a benefit to having a low-power NPU that's separate from the GPU.
1 comments

Okay, yeah, and those manufacturers’ opinions are both obvious reflections of market position independent of the merits, what do people who actually run inference say?

(Also, the NPUs usually aren't any more separate from the GPU than tensor cores are separate from an Nvidia GPU, they are integrated with the CPU and iGPU.)

If you're running an LLM there's a benefit in shifting prompt pre-processing to the NPU. More generally, anything that's memory-throughput limited should stay on the GPU, while the NPU can aid compute-limited tasks to at least some extent.

The general problem with NPUs for memory-limited tasks is either that the throughput available to them is too low to begin with, or that they're usually constrained to formats that will require wasteful padding/dequantizing when read (at least for newer models) whereas a GPU just does that in local registers.

Depends on how big the NPU is and how much power/memory the inference model needs.