Hacker News new | ask | show | jobs
by LarsDu88 28 days ago
TurboQuant. They can fit more in less now
1 comments

TurboQuant is a runtime optimization for a model's KV cache and doesn't allow for reduction in model size.
TurboQuant reduces the runtime memory needed for the model's KV cache.

This reduces both the memory bandwidth needed for inference (at the cost of slightly increasing the amount of compute needed), and the amount of VRAM used overall, meaning more VRAM can be allocated for more weights on the same hardware.

You were replying to a comment estimating model params from hardware. I am saying the param count could be higher for the same hardware.