|
|
|
|
|
by fxtentacle
944 days ago
|
|
Nothing public, sorry. I do consulting on how to convert AIs from CUDA to C++ to save money. With a good quantization, you can sometimes replace a $19k A100 with a $0.5k EPYC. And especially for apps and/or WebGL interference, you want small models. Anyway, if you quantize to -1, 0, or +1 and then use arithmetic coding, you come out at around 1.58 bits per parameter. And then by skewing the distribution with forced sparsity, you have something like 5% x -1, 90% x 0, 5% x +1 which comes out at about 0.6 bits per parameter after arithmetic coding. I used that on "gpt_neox.layers.*.mlp.dense_h_to_4h.weight" (HuggingFace PyTorch implementation), for example. But for other layers you need more bits. For example, I could never get gpt_neox.embed_in.weight to less than 2% -2, 8% -1, 80% 0, 8% +1, 2% +2 which comes out at around 1.1 bits per parameter [1]. And then stuff like gpt_neox.layers.0.attention.query_key_value.weight will drive up your overall bits per parameter because those are very difficult to quantize or sparsify. That 1.5 was the average over the entire model and some layers compress even better while others compress worse. [1] example calculation: https://www.wolframalpha.com/input?i=-%28log2%280.02%29*0.02... |
|