|
As a proxy for the total size of the parameters, you can just look at the download size of a model on Huggingface.co. Because for most models the weights are provided in many *.safetensors files of approximately the same size, you can estimate the total size without adding all file sizes by multiplying the number of *.safetensors files with the approximate size of one file. For quantized models, estimating the size is simpler, because there is just one GGUF file, which also includes metadata, but most of the file is occupied by the parameters. While there are models where the native size of all parameters is BF16, there are also models that use multiple parameter sizes, e.g. a large number of parameters with a small size, even down to 4 bits, together with a small number of parameters with a bigger size, up to FP32. Therefore, as you say, the number of parameters is much less informative about memory requirements than the file sizes. While the download size of the *.safetensors files or GGUF files is not the same as the total memory requirement, it can give an approximate estimate and it can be used to assess which of 2 models will need more memory. It becomes more complicated when you must use multiple kinds of memory, e.g. GPU memory and CPU memory, or even SSDs, when you must know more about the structure of the model to determine how much of each kind of memory is needed. |
That some models like Qwen3.6 27B seems to not be very affected by Q8 quantized KV cache while others degrade heavily doesn't make it easier.