Llama.cpp is a pretty extreme cpu ram bus saturator, but I dunno how close it is (and its kind of irrelevant because why wouldn't you use a Metal backend).
If you want to run larger models, then CPU inference is your only choice.
Also, not many implementations can even use it.
Llama.cpp is a pretty extreme cpu ram bus saturator, but I dunno how close it is (and its kind of irrelevant because why wouldn't you use a Metal backend).