|
|
|
|
|
by kaoD
82 days ago
|
|
> The model weights stay resident in VRAM permanently so there's no loading/unloading per request. Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right? If I keep sending large context buffers, will that hog the batches? |
|