Hacker News new | ask | show | jobs
by zaptrem 830 days ago
I think rule of thumb is 1GB VRAM per 1 billion params quantized to FP8.
1 comments

Just to load the model without actually running it requires 1GB of whatever RAM it is loading and running in (could be VRAM, system RAM, or a combination, with different performance characteristics for each option) per billion parameters at 8-bit quantization. Though models often are usefully run at 4-5 bit quantization, which saves half (or nearly so) of that.

You also need additional RAM that increases as some function of context size (not sure what function, and ISTR there are big-O differences between architectures in how it varies) to actually do inference.