Hacker News new | ask | show | jobs
by jyu 1068 days ago
for text, is there a standard to compare model results?
1 comments

There are tons of metrics people have come up with, for example look at the huggingface leaderboard. There are more niche leaderboards/tests for chat models, chain of thought, summarization and such.

But the best test is personal experimentation. Prompt engineering and subjective preference have a massive effect on finetune performance.