Hacker News new | ask | show | jobs
by Uehreka 1157 days ago
Do we have any quantitative way of benchmarking the quality of these models at all? Like, I don’t care if a model takes one minute per token on my laptop as long as it’s “GPT-4 quality”, and I don’t care if it does 100 tokens per second if it’s straight crap. But every comparison I see people make regarding quality seems to come from “I asked it a couple of my favorite questions and it did… uh, only a little worse than GPT imo”
1 comments

Perplexity scores are pretty common, which I think involves taking a text corpus like wikipedia and seeing how well the model predicts the next word (token) of it.