The real challenge for at scale inference is that the compute for models is too long to keep normal API connections open and you need a message passing system in place. This system also needs to be able to deliver large files for multi-modal models if it's not going to be obsolete in a year or two.
I build a proof of concept using email of all things but could never get anyone to fund the real deal which could run at larger than web scale.
You could use messenger pigeons if you felt like it.
People really don't understand how much better LLM swarms get with more agents. I never hit a point of diminishing returns on text quality over two days of running a swarm of llama2 70Bs on an 8x4090 cluster during the stress test.
You would need something similar to, but better than, whatsapp to handle the firehose of data that needs to cascade between agents when you start running this at scale.
>People really don't understand how much better LLM swarms get with more agents. I never hit a point of diminishing returns on text quality
Could you elaborate please ?
One use for swarms is to use multiple agents/prompts in place of one single agent with one long prompt in order to increase performance by splitting one big task into many. It is very time consuming though, as it requires experimenting to determine how best to divide one task into subtasks, including writing code to parse and sanitize each task output and plug it back into the rest of the agent graph.
Dspy [1] seems to target this problem space but last time I checked it only focused on single prompt optimization (by selecting which few shots examples lead to the best prompt performance for instance), but even though I have seen papers on the subject, I have yet to find a framework that tackles the problem of agent graph optimization although research on this topic has been done [2][3][4]
[2]TextGrad: Automatic 'Differentiation' via Text -- using large language models to backpropagate textual gradients: https://github.com/zou-group/textgrad
I've tried explaining this to supposedly smart people in both a 15 minute pitch deck and a research paper and unless they were inclined to think it from the start no amount of proof has managed to convince them.
I figure it's just not possible to convince people, even with the proof in front of them, of how powerful the system is. The same way that we still have people arguing _right now_ that all LLMs are just auto complete on steroids.
> people arguing _right now_ that all LLMs are just auto complete on steroids.
Funny because when I learned about how LLMS worked my immediate thought was "Oh, humans are just LLMs on steroids". So auto complete on steroids squared.
The real challenge for at scale inference is that the compute for models is too long to keep normal API connections open and you need a message passing system in place. This system also needs to be able to deliver large files for multi-modal models if it's not going to be obsolete in a year or two.
I build a proof of concept using email of all things but could never get anyone to fund the real deal which could run at larger than web scale.