Hacker News new | ask | show | jobs
by Tepix 1154 days ago
GTPQ has been the missing piece, it allows quantizing the model weights from 16 to 4 bits with only a small loss in quality. That it turn allows running even the large 65 billion parameter version of the LLaMA model in ~33GB of RAM or VRAM.

With VRAM that requires two 24GB GPUs which is no longer completely out of reach.

The model running in the browser is a smaller version with 7 billion parameters, which is good enough for some things.