|
|
|
|
|
by jackcosgrove
107 days ago
|
|
I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous. I then discovered what quantization is by reading a blog post about binary quantization. That seemed too good to be true. I asked Claude to design an analysis assessing the fidelity of 1, 2, 4, and 8 bit quantization. Claude did a good job, downloading 10,000 embeddings from a public source and computing a similarity score and correlation coefficient for each level of quantization against the float32 SoT. 1 and 2 bit quantizations were about 90% similar and 8 bit quantization was lossless given the precision Claude used to display the results. 4 bit was interesting as it was 99% similar (almost lossless) yet half the size of 8 bit. It seemed like the sweet spot. This analysis took me all of an hour so I thought, "That's cool but is it real?" It's gratifying to see that 4 bit quantization is actually being used by professionals in this field. |
|
It doesn't seem terribly common yet though. I think it is challenging to keep it stable.
[1] https://www.opencompute.org/blog/amd-arm-intel-meta-microsof...
[2] https://www.opencompute.org/documents/ocp-microscaling-forma...