Hacker News new | ask | show | jobs
by Chordless 595 days ago
This article is AI generated, and they didn't even fact check it. An AI module like this can help a lot with processing for certain types of neural networks, but LLMs are not one of them.

LLM inference is basically bottlenecked by RAM bandwidth and how much RAM you have. Every token to be generated needs to iterate over the whole model, pulling it piece by piece from the RAM to the CPU, where some relatively small calculations are applied.

Having a separate NPU like this connected via PCIE makes LLMs much slower, since you're bottlenecked by a PCIE 3.0 x1 connection instead of your full memory bandwidth.