|
|
|
|
|
by zozbot234
555 days ago
|
|
> For some people, they lack the resources to run an LLM on a GPU. Most people have a usable iGPU, that's going to run most models significantly slower (because less available memory throughput, and/or more of it being wasted on padding, compared to CPU) but a lot cooler than the CPU. NPU's will likely be a similar story. It would be nice if there was an easy way to only run the initial prompt+context processing (which is generally compute bound) on iGPU+NPU, but move to CPU for the token generation stage. |
|