| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cmsj 606 days ago
	It really bugs me that every time I see posts about new models, there is never any indication of how much VRAM one needs to actually run them.

1 comments

qeternity 606 days ago

That's because it's easily calculable and also somewhat impossible to say in any meaningful sense.

Most weights are released as fp16/bf16 so 2 bytes per weight. So just double the number of parameters = the number of gigabytes of VRAM. Llama 3.1 8B ~= 16GB weights in fp16. At 4bit quantization, it would be half the number of parameters so Llama 3.1 8B ~= 4GB weights.

But this is just weights. The real issue is context and output length: how much data are you feeding in? This is where VRAM can explode, and it's entirely use-case dependent. So for a 128k context model, the range of VRAM usage is huge.

The reality is, if you're not able to quickly estimate the above, you're probably not running local models anyway.

link

bick_nyers 606 days ago

Perhaps I'm being charitable but I read OP's comment in the light of what you described with context length. Batching, context length, and attention implementation vary these numbers wildly. I can fit a 6bit quant Mistral Small (22b) on a 3090 with ~10-12k context, but Qwen2VL (7b, well 8.3b if you include vision encoder) also maxes out my 3090 VRAM with an 8bit quant and ~16k context.

I do think it would be good to include some info. on "what we expect to be common deployment scenarios, and here's some sample VRAM values".

Tangentially, whenever these models get released with fine-tuning scripts (FFT and Lora) I've yet to find a model that provides accurate information on the actual amount of VRAM required to train the model. Often times it's always 8x80GB for FFT, even for a 7B model, but you can tweak the batch sizes and DeepSpeed config. to drop that down to 4x80GB, then with some tricks (8bit Adam, Activation Checkpointing), drop it down to 2x80GB.

link

formalsystem 606 days ago

You can estimate context length impact by doing back of the envelope calculations on KV cache size: 2 * layers * attention heads * head_dim * byte_per_element * batch_size * sequence_length

Some pretty charts here https://github.com/pytorch/ao/issues/539

link