Hacker News new | ask | show | jobs
by TYMorningCoffee 404 days ago
Can the inference piece be partitioned over multiple hosts?

Edit: algorithmed or partitioned in a way that overcomes the network bottleneck

2 comments

> prima.cpp is a distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices— laptops, desktops, phones, and tablets (GPU or no GPU, it’s all good). With it, you can run QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B right from your local home cluster!

https://github.com/Lizonghang/prima.cpp

Pretty sure llama.cpp can already do that
I forgot to clarify dealing with the network bottleneck
Just my two cents from experience, any sufficiently advanced LLM training or inference pipeline eventually figures out that the real bottleneck is the network!