| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ricardobeat 443 days ago
	> But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver You can use success rate % over N runs for a set of problems, which is something you can compare to other systems. A separate model does the evaluation. There are existing frameworks like DeepEval that facilitate this.