|
|
|
|
|
by suprjami
82 days ago
|
|
Some models really suffer badly from KV quantisation. You can also take a speed hit using dissimilar K and V types. TurboQuant seems to be the next big thing in context memory usage. Polar coordinates achieving ~5x reduction in memory usage with minimal/no quality loss, and even a slight speedup in some cases. |
|