Hacker News new | ask | show | jobs
by xnorswap 890 days ago
This game is well known in the UK as the "Connecting Wall" from Only Connect.

This result - poor Chat GPT performance - surprises me. I thought pattern detection and set forming was something that Chat GPT could do well. Perhaps it would need a model to be specifically trained for this task. If alpha-zero can master chess, then surely this game isn't beyond what is trainable.

You can prompt Chat GPT that it'll be playing the connecting wall without having to explain the game. It still fails to make a good set of connections when provided the wall.

One interesting part of the "Connecting Wall" sets is that there is almost always a "Wordy one" involving changing a letter, adding a prefix, anagrams, etc. Almost always a "Person" one for example there'll be a set of "Famous people named Tom..." but not a set of "Toms" with a set of "Margarets", and then a couple of general sets.

This is a huge help given the 2 minutes and 30 seconds provided.

On another note, it's possible that the GCHQ puzzle book would be in the training set, which has many puzzles with solutions for this format and a very similar rubrik with 55 items and sets of sizes 1 through 10. That said, Chat GPT perhaps would not tie the answers in the back of the book to the solutions in the front.

I all, I think an AI trained for this purpose with problems and given solutions ought to end up mastering this format. But a general purpose chat GPT seems like it performs very badly.

6 comments

> This result - poor Chat GPT performance - surprises me. I thought pattern detection and set forming was something that Chat GPT could do well

I would speculate it’s struggling because of the linear nature of its output, and the red-herring words which crossover between categories.

Because the model can’t “look ahead”, it starts spitting out valid combinations, but without being able to anticipate that committing to a certain combination early on will lead to a mistake later.

I expect if you asked it to correct its output in a followup message, it could do so without much difficulty.

> I expect if you asked it to correct its output in a followup message, it could do so without much difficulty.

I had a similar idea to the author and tried this many times, albeit with the free version of ChatGPT. After getting wrong results, I prompted it to correct them, even telling the model explicitly that a category is wrong or doesn't make sense. Nothing I did made a difference.

My two cents on why this doesn't work has to do with the fact that the answer should contain a discrete set of words given in the prompt, and importantly, they should not be duplicated. I suspect that these currents models are not very good at following the instruction "the token should appear in the answer exactly once"

> Because the model can’t “look ahead”, it starts spitting out valid combinations, but without being able to anticipate that committing to a certain combination early on will lead to a mistake later.

Aren't there already models that CAN look ahead? Or are there none?

Not sure how Alpha Zero is relevant to whether a transformer can play connections. Alpha zero is not a transformer and chess is not connections.
Néophyte question:

Can we infer anything about what llm's can achieve from what we can achieve with AIs like AlphaGo? I thought their approaches were completely separated

Not really;

Gpts are a class of text predictors. Ultimately they are ranked on whether or not the output is similar to the training data, text-wise. If the training data included a game then it may be able to play that game, but only if that game requires reasoning about entire words (because of tokenization, gpts can't reason in terms of letters, that's why they do poorly at crosswords for example)

On the flip side, alphazero is a class of networks that have a list of actions they can take, and a list of parameters they observe about the game (in chess: the board position, in other games: their position on screen, score, speed, etc). The model is then trained to take actions that maximize an actual hard value from the game, like winning a game of chess, capturing a piece, increasing a score, driving the furthest.

In theory you could train a model with the alphago method to do text prediction, but LLMs are called "large" for a reason, the input and output spaces would have to be the number of possible tokens (and at that point just train a normal gpt, it's much more efficient). Also in theory you could train a gpt to play games, but you're spending huge amounts of compute evaluating extraneous words in the input (the prompt) and the output (most words do not have anything to do with your game). on top of that, you're iterating over every word you generate to generate the next one, so you're doing multiple passes of this largely infficient computing, which means you're slower compared to a tailor-made model that can evaluate one situation once and give you a list of outputs to perform.

in this specific case it's a bit wierd because the input space for the alphazero model would have to be every word that can appear on the board, but the reasoning part is most likely not a problem given enough model size. since it's competing with a multi-gigabyte llm though, there is space to spare.

I've certainly thought about testing LLMs on Connections and I'm glad someone has. It might be possible to increase their performance, but LLMs as-is are not suited for the task.

The problem is that Connections is ultimately a search problem that requires more than simply grouping similar words. There are lots of combinations to assess. I bet if you enumerate, score, then rank all possible groupings, an LLM would perform much better.

ChatGPT4 solved today's riddle in the first try for me. Caution, spoilers ahead: https://chat.openai.com/share/0c40a0b5-ab8f-4094-a7cc-21bb94...

(it even ignored some embarrassing typos ...)

Doesn't this list the words in the order that they are grouped? The article states that randomizing the words completely eliminates any successful results
It didn't solve it -- instead it simply created groups in the exact order you provided.
Apart from the "it just explained the already ordered groups in the question" problem, it didn't even explain one of the groups correctly. "Something about coat(ing) and food" is not the correct explanation, it's missing a lateral logic step there to go from food-related to a separate meaning.
Chess has a well defined set of correct solutions. The rules are well known and understood.

Connections is much less so.