| I have the impression the implied conclusion is that under the situation described it would be better to consult different LLM models, than a specific one, but that is not what they demonstrate: to demonstrate this you measure the compute / cost of running and human-verifying the output. the statistics provided don't at all exclude the possibility that instead of giving the top 5 models each a single opportunity to propose a solution, it may be more efficient to give the 5 opportunities to solve the problem to the best scoring model: at 24% win rate the null hypothesis (what a usual researcher ought to predict based on common sense) would be that the probability of a loss is 76%, and the probability that it loses N times is (0.76 ^ N), and so the probability of it winning in N attempts is ( 1 - (0.76 ^ N ) ). So consulting the best scoring model twice (2 x top-1) I would expect: 42.24% better than the giving the 2 top scoring models each a single try ( 1 x top-2 ) as that resulted in 35% Same for 3x top-1 vs 1x top-3: 56.10% vs 51% Same for 4x top-1 vs 1x top-4: 66.63% vs 66% Same for 5x top-1 vs 1x top-5: 74.64% vs 73% Same for 6x top-1 vs 1x top-6: 80.73% vs 83% Same for 7x top-1 vs 1x top-7: 85.35% vs 90% Same for 8x top-1 vs 1x top-8: 88.87% vs 95% I can't read the numerical error bars on the top-1 model win rate, we could calculate a likelihood from to see if the deviation is statistically significant. |
This post measures `1x top-N` (one attempt each from N models), not `Nx top-1` (N attempts from the best-scoring model). We should make that more clear.
Part of why we chose `1x top-N` is that we expect lower error correlation compared to `Nx top-1`, which is also why the iid baseline is likely optimistic.
That said, a direct comparison (`Nx top-1` vs `1x top-N`, with the same review/compute budget) would be useful!