| HN Mirror

Great question. What I can say is we experimented a _ton_. If you take a basic approach and simply ask the same prompt of a bunch of LLMs and then ask another LLM to combine the results, you'll get a pretty poor answer. At best, you'll get a response that is the average of the ensemble, which by definition is going to be worse than the best model of the ensemble. Of course, you're going to want a mechanism to choose the ensemble effectively. At worst, you'll regurgitate the worst model of the ensemble. And you'll have the added expense and potential latency, too. Not a good solution at all.

We didn't experiment with different ensemble mechanisms rigorously enough for a research paper. We will, though.

Majority voting was actually how we started, and we came up with great mechanisms for stopping early, saving token costs and time, along with other interesting things we could do with that simple mechanism. The issue we had was that the orchestration could already choose a model beforehand almost as good (according to simpler benchmarks than HLE we ran at the time) as majority voting could pick after the responses were complete. And we tried many voting mechanisms, such as all models in the ensemble voting on all others.

An ablation study would be great to do now, with many other ideas we've played with. We have better benchmarks than we did just a few months ago, and it would be great to understand the tradeoffs of different approaches so that there could be alternative options for different use cases.