Hacker News new | ask | show | jobs
by wilonth 1085 days ago
7B params would take 14gb of gpu RAM at fp16 precision. So it would be able to run on 16gb GPUs with 2gb to spare for other small things.
1 comments

But in practice, no one is running inference at FP16. int8 is more like the bare minimum.
I have an 8GB, and I am considering two more 8GB, it should I get a single 16GB? The 8GB card was donated, and we need some pipelining... I have 10~15 2GB quadro cards... Apparently useless.
I mean... It depends?

You are just trying to host a llama server?

Matching the VRAM doesn't necessarily matter, get the most you can afford on a single card. Splitting beyond 2 cards doesn't work well at the moment.

Getting a non Nvidia card is a problem for certain backends (like exLLaMA) but fine for llama.cpp in the near future.

AFAIK most backends are not pipelined, the load jumps sequentially from one GPU to the next.