| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mdp2021 467 days ago
	Can I just wholeheartedly congratulate you for having found a critical benchmark to evaluate LLMs. Either they achieve 100% accuracy in your game, or they cannot be considered trustworthy. I remain very confident that modules must be added to the available architectures to achieve the "strict 100%" result.