| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by TYMorningCoffee 451 days ago
	Can the inference piece be partitioned over multiple hosts? Edit: algorithmed or partitioned in a way that overcomes the network bottleneck

2 comments

Maxious 451 days ago

> prima.cpp is a distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices— laptops, desktops, phones, and tablets (GPU or no GPU, it’s all good). With it, you can run QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B right from your local home cluster!

https://github.com/Lizonghang/prima.cpp

link

happyPersonR 451 days ago

Pretty sure llama.cpp can already do that

link

TYMorningCoffee 451 days ago

I forgot to clarify dealing with the network bottleneck

link

moralestapia 450 days ago

Just my two cents from experience, any sufficiently advanced LLM training or inference pipeline eventually figures out that the real bottleneck is the network!

link