Llama.cpp (and MLC) have to read the all the model weights from RAM for every token. Batching aside, there's no way around that.