| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by azinman2 864 days ago
	There’s lots to evaluate. If you’re evaluating model quality, there are many benchmarks all trying to measure different things… accuracy in translation, common sense reasoning, how well it stays on topic, can you regurgitate a reference in the prompt text, how biased is the output along a societal dimension, other safety measures, etc. I’m in the field but not an LLM researcher per se, so perhaps this is more meaningful to others, but given the post it seems useful to answer my question which was what _exactly_ is being evaluated? In particular this is only working off the encoded sentences so it seems to me that things that involve attention etc aren’t being evaluated here.