| There's been a fair bit of research over the last year on this topic. An easy read on a Harvard/MIT study: https://thegradient.pub/othello/ A follow-up on more technical aspects of what's going on with it: https://www.lesswrong.com/posts/nmxzr2zsjNtjaHh7x/actually-o... Two more studies since showing linear representations of world models: https://arxiv.org/abs/2310.02207 (modeling space and time) https://arxiv.org/abs/2310.06824 (modeling truth vs falsehood) It's worth keeping in mind these are all on smaller toy models compared to something like GPT-4, so there's likely more complex versions of a similar thing going on there, we just don't know to what extent as it's a black box. Part of the problem with evaluating the models based on responses is that they are both surface statistics/correlations and deeper processing, and often the former can obscure the latter. For example, in the first few weeks of release commentators on here were pointing out GPT-4 failed at variations of the wolf, goat, and cabbage problem. And indeed, giving it a version with a vegetarian wolf and a carnivorous goat it would still go to the classic answer of taking the goat first. But if you asked it to always repeat adjectives and nouns from the original problem together and change the nouns to emojis (, , ), it got it right every single time on the first try. So it did have the capacity to reason out variations of the problem, you just needed to bust the bias towards surface statistics around the tokens first. |