Hacker News new | ask | show | jobs
by brucethemoose2 1086 days ago
There are already papers on it, and there is 2-bit quant in llama.cpp.

But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now.

There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results.