Hacker News new | ask | show | jobs
LLMs are really bad at Tic-Tac-Toe (gensx.com)
4 points by d_legs 475 days ago
3 comments

It seems to me that there's a fundamental difference between how you would train an AI to provide a reasonable, coherent text-based or languaged-based response as part of a conversation; and how you would train it to optimize for a specific set of rules or goals. If you asked an LLM to formulate and explain a tic-tac-toe strategy, I expect it would respond with something robust.
Agreed. I tried asking the models to outline a strategy and they can produce a decent output although not as robust as I expected. I'm sure you could fine tune an LLM to be good at Tic-Tac-Toe too but the surprising thing to me was how LLMs--even top ones like gpt-4.5 don't generalize well enough to be half decent at a simple game.
That's kind of the thing, though, isn't it? It can explain a robust strategy, but it can't play it. That's really different from human intelligence, to the point where it seems reasonable to claim that the LLM doesn't really understand what it's saying.
I made something that played noughts and crosses (same thing) perfectly. Knowing what responses you will make and taking into account reflections and rotations I had a state machine with barely more than 100 states. It really doesn't need AI, but an LLM is a language model for goodness sake, give it a job it can do.
LLMs are supposed to be a step on the route to a general AI. This is a data point suggesting that, if it can't do this simple job, it may never be suited to a wide range of jobs that any human can do trivially. That's a serious crimp in its reputation as an AGI.

There may be a solution to that at some point. But right now it suggests we're more on the stochastic-parrot track.

Totally agree that there are better ways to build a program to beat tic tac toe. I'd expect an LLM could probably write the code itself with a few turns of the crank (it wrote most of the code I used to test this). The point here is to test the G in AGI: how well models can generalize to new tasks.
My quibble is with the use of language models. Human intelligence is not predicated on the written word. You wouldn't use an LLM to create something that plays tennis, yet our human general intelligence covers that.
Hi HN - I decided to try to compare LLMs by having them play Tic-Tac-Toe. The results were surprisingly bad considering all the talk about how LLMs have "saturated" benchmarks. Have you run into any tasks where LLMs are way worse than you'd expect?