Hacker News new | ask | show | jobs
by shahbazac 922 days ago
Can someone answer CS 101 questions about this please.

I know there are other methods related to matrix factorization, but I’m asking specifically about quantization.

Does quantization literally mean the weight matrix floats are being represented using fewer bits than the 64 bit standard?

Second, if fewer bits are being used, are CPUs able to do math directly on fewer bits? Aren’t CPU registers still 64 bit? Are these floats converted back to 64 bit for math, or is there some clever packing technique where a 64 bit float actually represents many numbers (sort of a hackey simd instruction)? Or do modern CPUs have the hardware to do math on fewer bits?

2 comments

This is for GPUs, not CPUs. GPUs do have lower precision ALUs to do math on fewer bits. Though not 2 bits - I believe there’s support for 1, 4 and 8 bit computation in modern Nvidia cards.

But even without such support there’s a benefit of model size compression so that bigger models can fit in GPU memory, eliminating costly CPU/GPU data transfers.

Yes but no. The actual values represented by the quantized bits don't use a representation akin to IEEE floating point, but they are able to act like floating point values due to mathematical transformations during propagation. The floating point values a quantized value corresponds to are chosen using some kind of precomputation depending on the quantization method