Hacker News new | ask | show | jobs
by chr15m 137 days ago
Ah interesting. Thank you very much for sharing the illuminating results.

One question I had - was the judgement blinded? Did judges know which models produced which output?

1 comments

It was not, the agent id is not overt but can be found via the workspace filepath.

But that is a good point. Perhaps it should be mapped to something unidentifiable.

Ah ok. If you do run it again that would be a worthwhile change. I know I personally have biases about models and I have seen others commenting the same - it seems likely it would skew the results at least a little.

Nonetheless you've convinced me to try an even wider variety of models, thanks!

In fact, this makes me think I should add this as a feature to my AI dev tooling - compare responses side by side and pick the best one.