| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ImageXav 316 days ago
	Yes, especially as models are known to have a preference towards outputs of models in the same family. I suspect this leaderboard would change dramatically with different models as the judge.

2 comments

jacquesm 316 days ago

I don't care about either method. The ground truth should be what a human would do, not what a model does.

link

mirekrusin 316 days ago

There may be different/better solutions for almost all those kind of tasks. I wouldn’t be surprised if optimal answer to some of them would be refusal/defer ask, refactor first, then solve it properly.

link

jacquesm 316 days ago

That response is quite in line with the typical human based PR response on a first draft.

There is a possibility that machine based PR reviews are better: for instance because they are not prejudiced based on who is the initiator of the PR and because they don't take other environmental factors into account. You'd expect a machine to be more neutral, so on that front the machine should and possibly could score better. But until the models consistently outperform the humans in impartially scored quality vs a baseline of human results it is the humans that should call this, not the machines.

link

jeltz 316 days ago

I wouldn't necessarily expect a machine to be more neutral. Machines can easily be biased too.

link

jacquesm 316 days ago

On something like a PR review I would. But on anything that would involve private information such as the background, gender, photographs and/or video as well as other writings by the subject I think you'd be right.

It's just that it is fairly trivial to present a PR to a machine in such a way that it can only comment on the differences in the code. I would find it surprising if that somehow led to a bias about the author. Can you give an example of how you think that would creep into such an interaction?

link

spiderfarmer 316 days ago

They are different models already but yes, I already let ChatGPT judge Claude's work for the same reason.

link