|
|
|
|
|
by talldayo
535 days ago
|
|
Oftentimes they do. If they don't, it's not very hard to page memory to and from the NPU until the operation is completed. The bigger problem is that this NPU hardware isn't built around scaling to larger models. It's laser-focused on dense computation and low-precision inference, which usually isn't much more efficient than running the same matmul as a compute shader. For Whisper-scale models that don't require insanely high precision or super sparse decoding, NPU hardware can work great. For LLMs it is almost always going to be slower than a well-tuned GPU. |
|