Hacker News new | ask | show | jobs
by mikeravkine 945 days ago
Hetzner offers incredibly cheap ARM machines in the Falkenstein DC, for 25Eur a month you can snag the top of the line with 16 vCPU and 32GB RAM.

If your usecase fits inside that 32GB (no 70B models, sadly) the price to performance of a GGUF Q4KM is really attractive on this setup.

1 comments

With two/three instances, you can probably fit a 70B model into RAM, and you don't need super low latency between models to be able to do inference split layerwise between machines.
Are there instructions for this distributed inference somewhere? Can I do this out of the box with llamacpp or similar?
Don't think so. I suspect it would require quite in-depth surgery of llamacpp to add in the ability to send activations over the internet and pipeline stuff to keep all the cores busy.