|
|
|
|
|
by d_legs
475 days ago
|
|
Hi HN - I decided to try to compare LLMs by having them play Tic-Tac-Toe. The results were surprisingly bad considering all the talk about how LLMs have "saturated" benchmarks. Have you run into any tasks where LLMs are way worse than you'd expect? |
|