| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mikeravkine 945 days ago
	Hetzner offers incredibly cheap ARM machines in the Falkenstein DC, for 25Eur a month you can snag the top of the line with 16 vCPU and 32GB RAM. If your usecase fits inside that 32GB (no 70B models, sadly) the price to performance of a GGUF Q4KM is really attractive on this setup.

1 comments

londons_explore 945 days ago

With two/three instances, you can probably fit a 70B model into RAM, and you don't need super low latency between models to be able to do inference split layerwise between machines.

link

fbdab103 945 days ago

Are there instructions for this distributed inference somewhere? Can I do this out of the box with llamacpp or similar?

link

londons_explore 945 days ago

Don't think so. I suspect it would require quite in-depth surgery of llamacpp to add in the ability to send activations over the internet and pipeline stuff to keep all the cores busy.

link