| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fxtentacle 944 days ago

Nothing public, sorry. I do consulting on how to convert AIs from CUDA to C++ to save money. With a good quantization, you can sometimes replace a $19k A100 with a $0.5k EPYC. And especially for apps and/or WebGL interference, you want small models.

Anyway, if you quantize to -1, 0, or +1 and then use arithmetic coding, you come out at around 1.58 bits per parameter. And then by skewing the distribution with forced sparsity, you have something like 5% x -1, 90% x 0, 5% x +1 which comes out at about 0.6 bits per parameter after arithmetic coding.

I used that on "gpt_neox.layers.*.mlp.dense_h_to_4h.weight" (HuggingFace PyTorch implementation), for example. But for other layers you need more bits. For example, I could never get gpt_neox.embed_in.weight to less than 2% -2, 8% -1, 80% 0, 8% +1, 2% +2 which comes out at around 1.1 bits per parameter [1]. And then stuff like gpt_neox.layers.0.attention.query_key_value.weight will drive up your overall bits per parameter because those are very difficult to quantize or sparsify. That 1.5 was the average over the entire model and some layers compress even better while others compress worse.

[1] example calculation: https://www.wolframalpha.com/input?i=-%28log2%280.02%29*0.02...

1 comments

zith 944 days ago

Is it possible to get good performance in computation when encoding the data this way, or is there a lot of cycles lost to packing and unpacking these bits?

link

fxtentacle 944 days ago

It's actually much faster if you're limited by RAM bandwidth because instead of doing float x float mul, which requires 8 bytes of load and 4 bytes of store, you do an int8 x int8 mul with 2 bytes in and 1 byte out. And typically for a quantized LNN like this, you'd only do packing and unpacking before or after a matmul on the low-dimensional vectors so that you can directly use the quantized weights.

E.g. you quantize a 512-float activation to 512-int8, then matmul with 512x4096, Gelu, 4096x512 all in int8, then de-quantize to 512-float. That means no quantization overhead on those 4,194,304 parameters in your Dense layers.

link