| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zozbot234 602 days ago

> For some people, they lack the resources to run an LLM on a GPU.

Most people have a usable iGPU, that's going to run most models significantly slower (because less available memory throughput, and/or more of it being wasted on padding, compared to CPU) but a lot cooler than the CPU. NPU's will likely be a similar story.

It would be nice if there was an easy way to only run the initial prompt+context processing (which is generally compute bound) on iGPU+NPU, but move to CPU for the token generation stage.