Hacker News new | ask | show | jobs
by smokel 648 days ago
Does anyone know why the sizes of these models are typically expressed in number of weights (i.e 1.5B and 9B in this case), without mentioning the weight size in bytes?

For practical reasons, I often like to know how much GPU RAM is required to run these models locally. The actual number of weights seems to only express some kind of relative power, which I doubt is relevant to most users.

Edit: reformulated to sound like a genuine question instead of a complaint.

3 comments

Because you can quantize a model e.g. from original 16 bits down to 5 bits per weight to fit your available memory constraints.
The weight size depends on the accuracy you are running the model at, you usually do not run a model at fp16 as it would be wasteful.
Since most LLMs are released as FP16, just the number of parameters is enough to know the total required GPU RAM.