|
|
|
|
|
by SushiHippie
783 days ago
|
|
Sorry I don't know much about this topic. The only thing I know (from using it) that with quantization I can fit models like llama2 13b, in my 24GB of VRAM when I use q8 (16GB) instead of fp16 (26GB). This means I can get nearly the full quality of llama2 13b's output while still being able to use only my GPU, without the need to do very slow inference on only CPU+RAM. And the models are quantized before inference, so I'd only download 16GB for the llama2 13b q8 instead of the full 26GB, which means it's not done on the fly. |
|