|
|
|
|
|
by Tepix
1154 days ago
|
|
GTPQ has been the missing piece, it allows quantizing the model weights from 16 to 4 bits with only a small loss in quality.
That it turn allows running even the large 65 billion parameter version of the LLaMA model in ~33GB of RAM or VRAM. With VRAM that requires two 24GB GPUs which is no longer completely out of reach. The model running in the browser is a smaller version with 7 billion parameters, which is good enough for some things. |
|