Y
Hacker News
new
|
ask
|
show
|
jobs
by
tarruda
930 days ago
Theoretically it could fit into a single 24GB GPU if 4-bit quantized. Exllama v2 has even more efficient quantization algorithm, and was able to fit 70B models in 24GB gpu, but only with 2048 tokens of context.