| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Uehreka 1205 days ago
	Do we have any quantitative way of benchmarking the quality of these models at all? Like, I don’t care if a model takes one minute per token on my laptop as long as it’s “GPT-4 quality”, and I don’t care if it does 100 tokens per second if it’s straight crap. But every comparison I see people make regarding quality seems to come from “I asked it a couple of my favorite questions and it did… uh, only a little worse than GPT imo”

1 comments

QuadrupleA 1205 days ago

Perplexity scores are pretty common, which I think involves taking a text corpus like wikipedia and seeing how well the model predicts the next word (token) of it.

link