Hacker News new | ask | show | jobs
by extheat 680 days ago
A simple equation to approximate it is `memory_in_gb = parameters_in_billions * (bits/8)`

So at 32 bit full precision, 70 * (32 / 8) ~= 280GB

fp16, 70 * (16 / 8) ~= 140GB

8 bit, 70 * (8 / 8) ~= 70GB

4 bit, 70 * (4 / 8) ~= 35GB

However in things like llama.cpp quants sometimes it's mixed so some of the weights are Q5, some Q4, etc, so you usually want to take the higher number.

1 comments

Well that and you also need a fair bit more space for the KV cache which can be a bit unpredictable. Models without GQA, flash attention or 4 bit cache support are really terrible in that regard, plus it depends on context length. Haven't found a good rule of thumb for that yet.