| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stri8ed 1148 days ago
	How does the quantization happen? Are the weights preprocessed before loading the model?

2 comments

ggerganov 1148 days ago

The weights are preprocessed into integer quants combined with scaling factors in various configurations (4, 5, 8-bits and recently more exotic 2, 3 and 6-bit quants). At runtime, we use efficient SIMD implementations to perform the matrix multiplication at integer level, carefully optimizing for both compute and memory bandwidth. Similar strategies are applied when running GPU inference - using custom kernels for fast Matrix x Vector multiplications

link

sebzim4500 1148 days ago

Yes, but to my knowledge it doesn't do any of the complicated optimization stuff that SOTA quantisation methods use. It basically is just doing a bunch of rounding.

There are advantages to simplicity, after all.

link

brucethemoose2 1148 days ago

Its not so simple anymore, see https://github.com/ggerganov/llama.cpp/pull/1684

link