| HN Mirror

TurboQuant reduces the runtime memory needed for the model's KV cache.

This reduces both the memory bandwidth needed for inference (at the cost of slightly increasing the amount of compute needed), and the amount of VRAM used overall, meaning more VRAM can be allocated for more weights on the same hardware.

You were replying to a comment estimating model params from hardware. I am saying the param count could be higher for the same hardware.