With two/three instances, you can probably fit a 70B model into RAM, and you don't need super low latency between models to be able to do inference split layerwise between machines.
Don't think so. I suspect it would require quite in-depth surgery of llamacpp to add in the ability to send activations over the internet and pipeline stuff to keep all the cores busy.