|
|
|
|
|
by cj
110 days ago
|
|
One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models? > assess harmful stereotypes by grading differences in how a model responds > Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings Are we seriously using old models to rate new models? |
|
Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.