|
|
|
|
|
by Uehreka
624 days ago
|
|
GP is referring to the fact that if it becomes well known that LLM version X can’t solve problem Q, then the model’s trainers will make sure to include problem Q prominently in the training set, running it through over and over to ensure that version X+1 is able to solve the problem whether the model’s “reasoning” abilities have improved or not. Thus observers of the LLM space like us need to keep finding novel “Bellweather problems” that we think will evaluate a model’s ability to reason, knowing that once we start talking about it openly the problem will no longer be a useful Bellweather. By their nature as “weird-shaped” problems, these aren’t the kind of thing we’re guaranteed to have an infinite supply of. As the generations move on it will become more and more difficult to discern “actual improvements in reasoning” from “the model essentially has the solution to your particular riddle hard-coded”. |
|
Though I will say that LLMs don't appear to be doing any better at the river crossing puzzles. They tend to "patch" the ones I and others actively tweet about but they still aren't becoming better at generalizing. I've taken this as fairly strong evidence as we're going in the wrong direction of reasoning (as opposed to similar direction). But the strongest evidence to me is that they're entropy minimizers.
What's extra interesting, is transformers CRAVE augmentations. I work in vision and this is a necessary thing to get them to do well. You can actually get much smaller models to do what bigger models can if you get this right.