Hacker News new | ask | show | jobs
by londons_explore 945 days ago
With two/three instances, you can probably fit a 70B model into RAM, and you don't need super low latency between models to be able to do inference split layerwise between machines.
1 comments

Are there instructions for this distributed inference somewhere? Can I do this out of the box with llamacpp or similar?
Don't think so. I suspect it would require quite in-depth surgery of llamacpp to add in the ability to send activations over the internet and pipeline stuff to keep all the cores busy.