|
|
|
|
|
by Dylan16807
806 days ago
|
|
My point is more that if it's that many bytes flowing around on demand, you're basically swapping layers in and out of the GPU as you use them (or x% of each layer). Which is fine, and it's a valid feature, but you don't need to split those bytes into "data" and "metadata" to make that happen. Is there actually something they gain from this particular method of splitting? |
|
Assume we have a 100b model - with 4bit, VRAM is around 57GB or so. With 1bit, 43GB VRAM, but by moving the scalars and zero point to RAM, VRAM use is like 15GB or something, at the cost of like 28GB RAM usage.
I guess maybe a valid approach is to dynamically select which ones to move to RAM or VRAM, given your VRAM budget. Say you have a 40GB GPU, clearly move more of the scalars to GPU. But if you have a weak GPU, then you need to use more RAM.