| HN Mirror

Models in this context are just a big list of numbers. The numbers will have a native "type", for example 32-bit floats. These are numbers like 0.7373663777 or -1.000003663. The 32-bit float type can represent something like 4.3 billion numbers (sort of).

It was discovered though, that while models may need this level of precision when creating them ("training"), they don't need it nearly as much after the fact, when simply running them to get results ("inference").

So quantisation is the process of getting that big set of, say, 32-bit floats, and "mapping" them to a much smaller number type. Eg, an 8-bit integer ("INT8"). This is a number in the range 0-255 (or -128 to +127).

So, to quantise a list of 32-bit floats, you could go through the list and analyse. Maybe they're all in the range -1.0 to +1.0. Maybe there are many around the value of 0.99999 and 0.998 etc, so you decide to assign those the value "255" instead.

Repeat this until you've squashed that bunch of 32-bit values into 8-bits each. (Eg, maybe 0.750000 could be 192, etc.)

This could give a saving in memory footprint for the model of 4x smaller, and also makes it able to be run faster. So while you needed 16GB to run it before, now you might only need 4GB.

The expense is the model won't be as accurate. But, typically this is on the order of values like 90%, versus the memory savings of 4x. So it's deemed worth it.

It's through this process folks can run models that would normally require a 5-figure GPU to run, on their home machine, or even on the CPU, as it might be able to process integers easier and faster than floating point.