Hacker News new | ask | show | jobs
by Sohcahtoa82 248 days ago
Models are made of "parameters" which are really weights in a large neural network. For each token generated, each parameter needs to take its turn inside the CPU/GPU to be calculated.

So if you have a 7B parameter model with 16-bit quantization, that means you'll have 14 GB/s of data coming in. If you only have 153 GB/sec of memory bandwidth, that means you'll cap out ~11 tokens/sec, regardless of how my processing power you have.

You can of course quantize to 8-bit or even 4-bit, or use a smaller model, but doing so makes your model dumber. There's a trade-off between performance and capability.

1 comments

I think you mean GB/token
Err...yup. My bad. Can't edit it now.