|
|
|
|
|
by irthomasthomas
15 days ago
|
|
Mercury-2 is amazing. I am using it frequently as the arbiter in llm-consortium
The context window is relatively small, so to make it work with larger consortiums I can construct a recursive sort-of meta consortium like this: llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank
llm consortium save cns-kimi -m k2.6 -n 5 --arbiter mercury-2 --judging-method rank
llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --arbiter mercury-2 --judging-method synthesis
Now when I prompt cns-meta-glm-kimi it will pick the best of five from kimi and glm before creating a synthesis from the two winners. |
|
I did some benchmarks recently of how well various models find security vulnerabilities, and then follow up testing of the judging process of whether the models found the right bug and whether other bugs it reported were false positives or legitimate other bugs. A committee of good-not-great models (DeepSeek, MiMo, Gemma 4) cannot replicate the accuracy of Opus by itself. Even when all three of the other models disagreed with Opus, Opus was almost always the one that was actually right.
It's an interesting area for research. And, a model that's very fast can make a lot more attempts at a solution, and in cases where there is an unambiguous "right" solution that can be proven by some sort of static rule, "very fast" may be a useful characteristic. Small classification problems, where you need to make thousands of decisions about some specific aspect of a large corpus of data, seems like a sweet spot for a model like Mercury.