| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brucethemoose2 1021 days ago

Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate.

The smaller the model, the less has to be read from ram for every single token.

Batching mixes up this calculus a bit.