There are already papers on it, and there is 2-bit quant in llama.cpp.
But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now.
There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results.
For some time I played with Brevitas and Xilinx's FINN, you could quantize like crazy. I haven't looked since transformers took over the AI world where they were.
But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now.
There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results.