Hacker News new | ask | show | jobs
by int_19h 1142 days ago
RTX 3090 or 4090 gets you 24Gb of VRAM, which is enough to run llama-30b (quantized to 4-bit with groupsize of 1024 or higher) at speeds comparable to ChatGPT. You can also get two and run the model split across them, although pumping data back and forth slows things down.

A brand new RTX A6000 (48Gb VRAM) is probably the largest you can get in a single card that can run in a regular PC. It can be had for $4-5k and is sufficient for llama-65b.

Beyond that, yeah, you're looking at dedicated multi-GPU server hardware.