|
|
|
|
|
by zozbot234
306 days ago
|
|
It seems you'll have to offload more and more layers to system RAM as your maximum context size increases. llama.cpp has an option to set the number of layers that should be computed on the GPU, whereas ollama tries to tune this automatically. Ideally though, it would be nice if the system ram/vram split could simply be readjusted dynamically as the context grows throughout the session. After all, some sessions may not even reach maximum size so trying to allow for a higher maximum ends up leaving valuable VRAM space unused during shorter sessions. |
|
Not a major setback because for long context I'd just use GPT or claude, but it would be cool to have 128k context locally on my machine. When I get a new CPU I'll upgrade RAM to 64, my GPU is more than capable of what I need for a while and a 5090 or 4090 is the next step up in VRAM but I don't want to shell out 2k for a card.