| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 4b6442477b1280b 324 days ago

with quantization, 20B fits effortlessly in 24GB

with quantization + CPU offloading, non-thinking models run kind of fine (at about 2-5 tokens per second) even with 8 GB of VRAM

sure, it would be great if we could have models in all sizes imaginable (7/13/24/32/70/100+/1000+), but 20B and 120B are great.