| HN Mirror

Just to load the model without actually running it requires 1GB of whatever RAM it is loading and running in (could be VRAM, system RAM, or a combination, with different performance characteristics for each option) per billion parameters at 8-bit quantization. Though models often are usefully run at 4-5 bit quantization, which saves half (or nearly so) of that.

You also need additional RAM that increases as some function of context size (not sure what function, and ISTR there are big-O differences between architectures in how it varies) to actually do inference.