|
|
|
|
|
by KTibow
521 days ago
|
|
Interesting eval but my first reaction is "using Mixtral as a judge doesn't sound like a good idea". Have you tested how different its results are from GPT-4 as a judge (on a small scale) or how stuff like style and order can affect its judgements? Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper |
|
We would have liked to pick a neutral model like Gemini which was fast, reliable and low cost, unfortunately it gave too many poor answers good grades [1]. If we had to pick a new grading model now, hopefully the much improved Gemini Flash 2.0 might yield better results.
[1] https://pvq.app/posts/individual-voting-comparison#gemini-pr...