|
|
|
|
|
by lolinder
520 days ago
|
|
Honestly, the fact that you used an LLM to grade the answers at all is enough to make me discount your results entirely. That it showed obvious preference to the model with which it shares weights is just a symptom of the core problem, which is that you had to pick a model to trust before you even ran the benchmarks. The only judges that matter at this stage are humans. Maybe someday when we have models that humans agree are reliably good you could use them to judge lesser-but-cheaper models. |
|