| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by minimaxir 969 days ago
	The presentation mentioned dynamic GPU caching: that seems like something transformer models would like.

2 comments

monocasa 969 days ago

Could be, but I'd like to hear more information about what it actually entails.

My gut feeling is that it's kind of like Z compression, but using the high amount of privileged software (basically a whole RTOS) they run on the GPU to dynamically allocate pages so that scare quotes "vram" allocations don't require giant arenas.

If that's the case, I'm not sure that ML will benefit. Most ML models are pretty good about actually touching everything they allocate, in which case, lazy allocations won't help you much and may actually get in the way startup latency.

link

brucethemoose2 969 days ago

In addition to what mono said, llama.cpp allocates everything up front with "--mlock"

Llama.cpp (and MLC) have to read the all the model weights from RAM for every token. Batching aside, there's no way around that.

link

Art9681 968 days ago

Mlock is an optional parameter: github.com/ggerganov/llama.cpp/tree/master/examples/main#mlock

link