Hacker News new | ask | show | jobs
by dsl 2 hours ago
Heh. I built "Fusion" a few months ago as an MCP using OpenRouter. The idea was to give Claude a "panel of experts" to go talk to when it got stuck.

After extensive testing and benchmarking I discovered that when you ask one model to judge another's response you don't actually get a better answer. You are just asking it "how closely does this resemble the answer you would have given me." Additional rounds and all the "obvious" solutions that pop into your mind reading the proceeding sentence are essentially just cranking up the temperature.

I did find a solution, but it is insanely expensive. Maybe if this gains traction I'll release mine.

4 comments

Prompt matters. Obviously if you want another model opinion you must generate from the scratch using the same prompt and then you can try to synthesize, but working with an existing response can work if desired. I use explicit instructions to find issues with assigned severities and then these are going through the panel of judges, only issues passing certain threshold are fixed in the original response.

I'll share a revelation which vastly improved my results: tell judges to evaluate truth and usefulness/should-be-fixed axis separately. Because inevitably with a prompt that is forcing to find issues you will end up with nitpicks. Plus truth axis allows to better evaluate the issue-finder models for your use case.

That's some part of what happens when I generate explanations like this one: https://hanzirama.com/character/%E6%9D%A5#explain - at this point the site is a small side product of my LLMs-evaluation machinery.

Bonus content for patient readers: if you need top quality you will likely need to pin provider(s) on OR, :exacto is not enough to get good repeatable results especially for open-weights models.

Yeah, same experience. It turned out that objectively better answers were not that easy to find plus the expense plus it’s slow.
I think it depends.

I regularly ask both GPT and Gemini to give me options - programming libraries to do X, architecture suggestions, names for projects/services/classes

After they answer I ask each model what does it think of the other answer, and to give me a final suggestion considering both answers.

Both GPT and Gemini would frequently say "that other answer is much better than my one, it considered X factor that I missed".

But.. but I told the LLM that it is an _expert_, is that worth nothing??
Make sure to remind it to make no mistakes.