Hacker News new | ask | show | jobs
by ImageXav 316 days ago
Yes, especially as models are known to have a preference towards outputs of models in the same family. I suspect this leaderboard would change dramatically with different models as the judge.
2 comments

I don't care about either method. The ground truth should be what a human would do, not what a model does.
There may be different/better solutions for almost all those kind of tasks. I wouldn’t be surprised if optimal answer to some of them would be refusal/defer ask, refactor first, then solve it properly.
That response is quite in line with the typical human based PR response on a first draft.

There is a possibility that machine based PR reviews are better: for instance because they are not prejudiced based on who is the initiator of the PR and because they don't take other environmental factors into account. You'd expect a machine to be more neutral, so on that front the machine should and possibly could score better. But until the models consistently outperform the humans in impartially scored quality vs a baseline of human results it is the humans that should call this, not the machines.

I wouldn't necessarily expect a machine to be more neutral. Machines can easily be biased too.
On something like a PR review I would. But on anything that would involve private information such as the background, gender, photographs and/or video as well as other writings by the subject I think you'd be right.

It's just that it is fairly trivial to present a PR to a machine in such a way that it can only comment on the differences in the code. I would find it surprising if that somehow led to a bias about the author. Can you give an example of how you think that would creep into such an interaction?

They are different models already but yes, I already let ChatGPT judge Claude's work for the same reason.