Hacker News new | ask | show | jobs
by brucethemoose2 969 days ago
In addition to what mono said, llama.cpp allocates everything up front with "--mlock"

Llama.cpp (and MLC) have to read the all the model weights from RAM for every token. Batching aside, there's no way around that.

1 comments

Mlock is an optional parameter: github.com/ggerganov/llama.cpp/tree/master/examples/main#mlock