There are tons of metrics people have come up with, for example look at the huggingface leaderboard. There are more niche leaderboards/tests for chat models, chain of thought, summarization and such.
But the best test is personal experimentation. Prompt engineering and subjective preference have a massive effect on finetune performance.
But the best test is personal experimentation. Prompt engineering and subjective preference have a massive effect on finetune performance.