|
|
|
|
|
by everythingctl
60 days ago
|
|
Maybe we can run more powerful models locally. I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center. |
|
So shrinking that by 6x (from fp16), would be big win for larger models. True, while TurboQuant can also be applied to model weights, it won't save size over q4 compression, but will have better accuracy.
Edits: Better context