|
|
|
|
|
by danielhanchen
812 days ago
|
|
I guess it's mainly to reduce VRAM usage. Assuming we don't do this, then a 7b model with 1bit group size 8 will use 3GB or something of VRAM, whilst a 4bit group size of 64 will use 4GB approx. Assume we have a 100b model - with 4bit, VRAM is around 57GB or so. With 1bit, 43GB VRAM, but by moving the scalars and zero point to RAM, VRAM use is like 15GB or something, at the cost of like 28GB RAM usage. I guess maybe a valid approach is to dynamically select which ones to move to RAM or VRAM, given your VRAM budget. Say you have a 40GB GPU, clearly move more of the scalars to GPU. But if you have a weak GPU, then you need to use more RAM. |
|
What if you store 40% of the data and 40% of the metadata in CPU memory. Is that the same for performance?
Is there a reason we want particular parts of the layer to stay on the GPU? Or is it just number of bytes.