Hacker News new | ask | show | jobs
by d_legs 475 days ago
Hi HN - I decided to try to compare LLMs by having them play Tic-Tac-Toe. The results were surprisingly bad considering all the talk about how LLMs have "saturated" benchmarks. Have you run into any tasks where LLMs are way worse than you'd expect?