Hacker News new | ask | show | jobs
by fullstackchris 1030 days ago
but how is the speed here? does it feel fast "enough"?

looking into to running llama on prem / private cloud but i have no idea where to start in terms of sizing, do you have any details or posts on to what the minimum / recommended hardware requirements are?

EDIT: just looked myself, not as encouraging as I'd like: "For good results, you should have at least 10GB VRAM at a minimum for the 7B model, though you can sometimes see success with 8GB VRAM. The 13B model can run on GPUs like the RTX 3090 and RTX 4090"

definitely borderline dealbreaking for solo hackers / small teams

1 comments

1x 3090 IMO is about the minimum you'd want to waste time with. It can serve a 13b + 7b model at once if you want, you can qlora train a 13b with a ton of context length, and it's fast enough to iterate with for training.

I have 2x 3090 in my machine, and I can do inference of ~40tokens/sec on a 13b llama2 model on one card. I can split the 70b parameter model between the two cards and get ~12-15tokens/sec. I can't train the 70b parameter model with my 2x 3090 though sadly, not quite enough vram.