Hacker News new | ask | show | jobs
by pclmulqdq 609 days ago
The correct way to make a true "NPU" is to 10x your memory bandwidth and feed a regular old multicore CPU with SIMD/vector instructions (and maybe a matrix multiply unit).

Most of these small NPUs are actually made for CNNs and other models where "stream data through weights" applies. They have a huge speedup there. When you stream weights across data (any LLM or other large model), you are almost certain to be bound by memory bandwidth.

2 comments

Apple Silicon is surprisingly a good approach here -

   * On CPU: SIMD NEON
   * On CPU: custom matrix multiply accelerator, separate from SIMD unit
   * On CPU package: NPU
   * GPU
Then they go and hide it all in proprietary undocumented features and force you to use their framework to access it :c
I’m sure we’ll get GPNPU. Low precision matvecs could be fun to play with.
SHAVE from MOVIDIUS was fun, before Intel bought them out.
Did they become un-fun? There are a bunch on the new Intel CPUs.
Most of the toolchain got hidden behind openvino and there was no hardware released for years. Keembay was 'next year' for years. I have some code for DSP using it that I can't use anymore. Has Intel actually released new shave cores, with an actual dev environment ? I'm curious.
The politics behind the software issues are complex. At least from the public presentation the new SHAVE cores are not much changed besides bigger vector units. I don't know what it would take to make a lower level SDK available again but it sure seems like it would be useful.