|
|
|
|
|
by Dylan16807
807 days ago
|
|
> "This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart! Doesn't it need all the weight metadata for a layer to use that layer? * If yes, then can't any algorithm offload x% of its data as a balancing act between speed and RAM? * If no, then what's it for and when does it get used? |
|
It's like in cuBLAS you do alphaAB + beta*C, and alpha and beta are both scalars which can be on the CPU, and moved to the GPU in nanaseconds.
I'm unsure though since I haven't tested it