|
|
|
|
|
by Sohcahtoa82
248 days ago
|
|
Models are made of "parameters" which are really weights in a large neural network. For each token generated, each parameter needs to take its turn inside the CPU/GPU to be calculated. So if you have a 7B parameter model with 16-bit quantization, that means you'll have 14 GB/s of data coming in. If you only have 153 GB/sec of memory bandwidth, that means you'll cap out ~11 tokens/sec, regardless of how my processing power you have. You can of course quantize to 8-bit or even 4-bit, or use a smaller model, but doing so makes your model dumber. There's a trade-off between performance and capability. |
|