| I confess to the same mirror image issue. I cannot understand why people insist that regressing in a latent space, derived from the mere associative structure of a dataset, ought be given some Noble status. It is not a model of our intelligence. It's a stupid thing. You can go and learn about animal intelligence -- and merging template cases of what's gone before, as recorded by human social detritus -- doesn't even bare mentioning. The latent space of all the text tokens on the internet is not a model of the world; and finding a midpoint is just a trick. It's a merging between "stuff we find meaningful over here", and "stuff we find meaningful over there" to produce "stuff we find meaningful" -- without ever having to know what any of it meant. The trick is that we're the audience, so we'll find the output meaningful regardless. Image generators don't "struggle with hands" they "struggle" with everything -- is we, the observer, who care more about the fidelity of hands. The process of generating pixels is uniformly dumb. I don't see anything more here than "this is the thing that I know!" therefore "this is a model of intelligence!11.11!01!!" . It's a very very bad model of intelligence. The datasets involved are egregious proxy measures of the world whose distribution has little to do with it: novels, books, pdfs, etc. This is very far away from the toddler who learns to walk, learns to write, and writes what they are thinking about. They write about their day, say -- not because they "interpolate" between all books ever written... but because they have an interior representational life which is directly caused by their environment and can be communicated. Patterns in our communication are not models of this process. They're a dumb light show. |