Hacker News new | ask | show | jobs
by adrian_b 34 days ago
As a proxy for the total size of the parameters, you can just look at the download size of a model on Huggingface.co.

Because for most models the weights are provided in many *.safetensors files of approximately the same size, you can estimate the total size without adding all file sizes by multiplying the number of *.safetensors files with the approximate size of one file.

For quantized models, estimating the size is simpler, because there is just one GGUF file, which also includes metadata, but most of the file is occupied by the parameters.

While there are models where the native size of all parameters is BF16, there are also models that use multiple parameter sizes, e.g. a large number of parameters with a small size, even down to 4 bits, together with a small number of parameters with a bigger size, up to FP32. Therefore, as you say, the number of parameters is much less informative about memory requirements than the file sizes.

While the download size of the *.safetensors files or GGUF files is not the same as the total memory requirement, it can give an approximate estimate and it can be used to assess which of 2 models will need more memory. It becomes more complicated when you must use multiple kinds of memory, e.g. GPU memory and CPU memory, or even SSDs, when you must know more about the structure of the model to determine how much of each kind of memory is needed.

1 comments

The KV cache size is a joker though. Different models use very different amounts of memory per token in the KV cache. The VRAM requirements for say 64k context can vary almost by an order of magnitude. So the download size might indicate you should have room for the model, how much context you can fit in the leftover VRAM budget is harder to predict at a glance.

That some models like Qwen3.6 27B seems to not be very affected by Q8 quantized KV cache while others degrade heavily doesn't make it easier.