|
|
|
|
|
by wkat4242
495 days ago
|
|
Even at q8_0? I thought it wasn't bad just like the models itself. But very interested to hear. And q8_0 already halves the memory usage compared to fp16. One of the ollama Devs called the quality impact negligible at q8_0: https://smcleod.net/2024/12/bringing-k/v-context-quantisatio... But perhaps quantifying the KV cache does not scale as gracefully as the model itself? |
|