Hacker News new | ask | show | jobs
by devit 946 days ago
Based on the first figure in the paper, it seems that this scheme effectively turns 8 input values into a 4-bit number, thus giving an effective 0.5-bit quantization.

Considering that current aggressive quantization for LLM transformers uses 4 bits, does such a 0.5-bit quantization produce an effective neural network?

Does the scheme stay competitive if it is changed to use 4-bit quantization instead of 0.5-bit?

1 comments

This is product quantization (a vector is chopped up into sub-vectors where each sub-vector is quantized using vector quantization (VQ)), not scalar quantization (which is what you're comparing it to here).

Also most scalar quantization methods use uniform quantization (e.g., divide the range between the scalar lower bound L and scalar upper bound H into N different regions where N is usually 2^bit_width), whereas PQ (and VQ) is learned quantization via k-means on some training vector set, so they're not really directly comparable.