| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brucethemoose2 969 days ago
	In addition to what mono said, llama.cpp allocates everything up front with "--mlock" Llama.cpp (and MLC) have to read the all the model weights from RAM for every token. Batching aside, there's no way around that.

1 comments

Mlock is an optional parameter: github.com/ggerganov/llama.cpp/tree/master/examples/main#mlock