| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pclmulqdq 609 days ago
	The correct way to make a true "NPU" is to 10x your memory bandwidth and feed a regular old multicore CPU with SIMD/vector instructions (and maybe a matrix multiply unit). Most of these small NPUs are actually made for CNNs and other models where "stream data through weights" applies. They have a huge speedup there. When you stream weights across data (any LLM or other large model), you are almost certain to be bound by memory bandwidth.

2 comments

sounds 609 days ago

Apple Silicon is surprisingly a good approach here -

   * On CPU: SIMD NEON
   * On CPU: custom matrix multiply accelerator, separate from SIMD unit
   * On CPU package: NPU
   * GPU

Then they go and hide it all in proprietary undocumented features and force you to use their framework to access it :c

link

bee_rider 609 days ago

I’m sure we’ll get GPNPU. Low precision matvecs could be fun to play with.

link

touisteur 609 days ago

SHAVE from MOVIDIUS was fun, before Intel bought them out.

link

hedgehog 609 days ago

Did they become un-fun? There are a bunch on the new Intel CPUs.

link

touisteur 609 days ago

Most of the toolchain got hidden behind openvino and there was no hardware released for years. Keembay was 'next year' for years. I have some code for DSP using it that I can't use anymore. Has Intel actually released new shave cores, with an actual dev environment ? I'm curious.

link

hedgehog 609 days ago

The politics behind the software issues are complex. At least from the public presentation the new SHAVE cores are not much changed besides bigger vector units. I don't know what it would take to make a lower level SDK available again but it sure seems like it would be useful.

link