| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by avereveard 998 days ago
	You can ask gpt4 or other high value model to rate two chat logs for coherency etc, not as accurate as human evaluation, but you don't have to read thousand lines of text if comparing many models.

1 comments

brucethemoose2 998 days ago

This is problematic if you are comparing a model in the same base family as the evaluator, as it will probably favor itself because it literally has the sequences it would naturally emit.

link