| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by londons_explore 945 days ago
	With two/three instances, you can probably fit a 70B model into RAM, and you don't need super low latency between models to be able to do inference split layerwise between machines.

1 comments

fbdab103 945 days ago

Are there instructions for this distributed inference somewhere? Can I do this out of the box with llamacpp or similar?

link

londons_explore 945 days ago

Don't think so. I suspect it would require quite in-depth surgery of llamacpp to add in the ability to send activations over the internet and pipeline stuff to keep all the cores busy.

link