|
|
|
|
|
by zozbot234
167 days ago
|
|
Prompt processing could be sped up with NPU inference. The Strix Halo NPU is a bit weird (XDNA 2, so the architecture is spatial dataflow and programmable interconnects), but it's there. See https://github.com/FastFlowLM/FastFlowLM (which is directly supported by https://lemonade-server.ai/ https://github.com/lemonade-sdk/lemonade ) for one existing project that's planning to support the NPU for the prompt processing phase. (Do note that FLM are providing proprietary NPU kernels under a non-free license, so make sure that this fits your needs before use.) |
|
AMD’s own marketing numbers say the NPU is about 50 TOPS out of 126 TOPS total compute for the platform. Even if you hand-wave everything else away, that caps the theoretical upside at around ~1.6x.
But that assumes:
1. Your workload maps cleanly onto the NPU’s 8-bit fast path.
2. There’s no overhead coordinating the iGPU + NPU.
My expectation is the real-world gain would be close to 0, but I'd love to be proven wrong!