|
|
|
|
|
by brucethemoose2
1021 days ago
|
|
Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate. The smaller the model, the less has to be read from ram for every single token. Batching mixes up this calculus a bit. |
|