| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Imnimo 684 days ago
	I'm not saying that sampling and majority voting performed worse. I'm saying that multi-agent interaction (labeled Debate and Reflection) performed worse than straightforward approaches that just query multiple times. For example, the Debate method combined with their voting mechanism gets 0.48 GSM8K with Llama2-13B. But majority voting with no multi-agent component (Table 2) gets 0.59 on the same setting. And majority voting with Chain-of-thought (Table 3 CoT/ZS-CoT) does even better. Fundamentally, drawing multiple independent samples from an LLM is not "AI Agents". The only rows on those tables that are AI Agents are Debate and Reflection. They provide marginal improvements on only a few task/model combinations, and do so at a hefty increase in computational cost. In many tasks, they are significantly behind simpler and cheaper alternatives.

2 comments

alsima 684 days ago

I see. Agree with the point about marginal improvements at a hefty increase in computational cost (I touch on this a bit at the end of the blog post where I mention that better performance requires better tooling/base models). Though I would still consider sampling and voting a “multi-agent” framework as it’s still performing an aggregation over multiple results.

link

potatoman22 684 days ago

My take: the agents are bad, the ensembling approach is good.

link

Imnimo 683 days ago

Ensembling is okay, but it doesn't scale great. Like you can put in a bit more compute to draw more samples and get a better result, but you can't keep doing it without hitting a plateau, because eventually the model's majority vote stabilizes. So it's a perfectly reasonable thing to do, as long as we temper our expectations.

link

alsima 684 days ago

Potentially true as well haha

link