| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cj 110 days ago

One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?

> assess harmful stereotypes by grading differences in how a model responds

> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings

Are we seriously using old models to rate new models?

2 comments

hex4def6 110 days ago

If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.

Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.

link

titanomachy 110 days ago

Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…

link