|
|
|
|
|
by pclmulqdq
609 days ago
|
|
The correct way to make a true "NPU" is to 10x your memory bandwidth and feed a regular old multicore CPU with SIMD/vector instructions (and maybe a matrix multiply unit). Most of these small NPUs are actually made for CNNs and other models where "stream data through weights" applies. They have a huge speedup there. When you stream weights across data (any LLM or other large model), you are almost certain to be bound by memory bandwidth. |
|