You're missing another big advantage is cost. If you can do 1000tok/s on a $2/hr H100 vs 60tok/s on the same hardware, you can price it at 1/40th of the price for the same margin.
You can also slow down the hardware (say, dropping the clock and then voltages) to save huge amounts of power, which should be interesting for embedded applications.
out of curiosity, is anyone here using AI in embedded with experiences to share? I see NPUs and the like popping up more on credit card and buildroot SBCs I get, but with zero documentation or sample scripts for them.