| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ankitmathur 1206 days ago
	Out of curiosity: what's an example of a metric that you would use to evaluate the ability of the model? For example, just looking qualitatively, asking a prompt like "How do I tie a tie?" to Pythia produces content that isn't even reasonably responding to that. And yet many benchmarks have no problem with that