So maybe I think about things a little differently, but is there a theoretical reason why we should expect a large language model to be good at sudokus? I remember not long ago they often struggled with adding two numbers
>is there a theoretical reason why we should expect a large language model to be good at sudokus
Because LLMs have shown the ability to be good at many tasks not directly related to language, and even exhibited some crude "general intelligence" traits.
So, some people would like to find how far this can be pushed, and why it works for e.g. a lot of tasks involving abstract manipulation of symbols and logical analysis, but not for a basic enough and clear goal like solving a simple sudoku.
It's very hard to define what is and is not "related to language" and this is kind of a fundamental question that seemed to get a lot of attention in the 20th century. Maybe these language models can help shine some light on that.
According to OpenAI, GPT-4 scores 4 on AP Calculus BC, 5 on AP Statistics, 4 on AP Chemistry, 4 on AP Physics 2. But is mathematical/logical reasoning largely a language task? I don't really know. I feel pretty confident saying that riding a bike is not a language task, but logical reasoning, I'm not so sure.
You also have to recall that these models were trained on the study materials of all of those tasks. That doesn't cheapen the achievement except to say, it's not "emergent behavior". Probably has half a billion weights dedicated to each of those exams.
LLMs are good at a lot of things we don't have a good reason to expect them to be good at. It's very hard to come up with "theoretical reasons" it should be good at things, in "theory" they should not be nearly as capable as they are. Even NLP researchers have been shocked at how well this has worked.
If there is no theory, or expected result why should anyone care what it's good at or not? You kinda get what you get and if you don't get what you want you do what?
Because LLMs have shown the ability to be good at many tasks not directly related to language, and even exhibited some crude "general intelligence" traits.
So, some people would like to find how far this can be pushed, and why it works for e.g. a lot of tasks involving abstract manipulation of symbols and logical analysis, but not for a basic enough and clear goal like solving a simple sudoku.