|
|
|
|
|
by manmal
475 days ago
|
|
My understanding is that the GPU must still load its assigned layer from VRAM into registers and L2 cache for every token, because those aren’t large enough to hold a significant portion. So naively, for a 24GB layer, you‘d need to move up to 24GB for every token. |
|