|
|
|
|
|
by brucethemoose2
1086 days ago
|
|
There are already papers on it, and there is 2-bit quant in llama.cpp. But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now. There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results. |
|