| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by d_legs 475 days ago
	Hi HN - I decided to try to compare LLMs by having them play Tic-Tac-Toe. The results were surprisingly bad considering all the talk about how LLMs have "saturated" benchmarks. Have you run into any tasks where LLMs are way worse than you'd expect?