| HN Mirror

Perhaps I'm being charitable but I read OP's comment in the light of what you described with context length. Batching, context length, and attention implementation vary these numbers wildly. I can fit a 6bit quant Mistral Small (22b) on a 3090 with ~10-12k context, but Qwen2VL (7b, well 8.3b if you include vision encoder) also maxes out my 3090 VRAM with an 8bit quant and ~16k context.

I do think it would be good to include some info. on "what we expect to be common deployment scenarios, and here's some sample VRAM values".

Tangentially, whenever these models get released with fine-tuning scripts (FFT and Lora) I've yet to find a model that provides accurate information on the actual amount of VRAM required to train the model. Often times it's always 8x80GB for FFT, even for a 7B model, but you can tweak the batch sizes and DeepSpeed config. to drop that down to 4x80GB, then with some tricks (8bit Adam, Activation Checkpointing), drop it down to 2x80GB.