Hacker News new | ask | show | jobs
by minimaxir 969 days ago
The presentation mentioned dynamic GPU caching: that seems like something transformer models would like.
2 comments

Could be, but I'd like to hear more information about what it actually entails.

My gut feeling is that it's kind of like Z compression, but using the high amount of privileged software (basically a whole RTOS) they run on the GPU to dynamically allocate pages so that scare quotes "vram" allocations don't require giant arenas.

If that's the case, I'm not sure that ML will benefit. Most ML models are pretty good about actually touching everything they allocate, in which case, lazy allocations won't help you much and may actually get in the way startup latency.

In addition to what mono said, llama.cpp allocates everything up front with "--mlock"

Llama.cpp (and MLC) have to read the all the model weights from RAM for every token. Batching aside, there's no way around that.

Mlock is an optional parameter: github.com/ggerganov/llama.cpp/tree/master/examples/main#mlock