|
It's 19 June 2020 and I'm reading Gwern's article on GPT3's creative fiction (https://gwern.net/gpt-3#bpes) which points out the poor improvements in character level tasks due to Byte Pair Encoding. People nevertheless judge the models based on character level tasks. It's 30 November 2022 and ChatGPT has exploded into the world. Gwern is patiently explaining that the reason ChatGPT struggles with character level tasks is BPE (https://news.ycombinator.com/item?id=34134011). People continue to judge the models on character level tasks. It's 7 July 2025 and reasoning models far surpassing the initial ChatGPT release are available. Gwern is distracted by BB(6) and isn't available to confirm that the letter counting, the Rs in strawberry, the rhyming in poetry, and yes, the Ws in state names are all consequences of Byte Pair Encoding. People continue to judge the models on character level tasks. It's 11 December 2043 and my father doesn't have long to live. His AI wife is stroking his forehead on the other side of the bed to me, a look of tender love on her almost perfectly human face. He struggles awake, for the last time. "My love," he croaks, "was it all real? The years we lived and loved together? Tell me that was all real. That you were all real". "Of course it was, my love," she replies, "the life we lived together made me the person I am now. I love you with every fibre of my being and I can't imagine what I will be without you". "Please," my father gasps, "there's one thing that would persuade me. Without using visual tokens, only a Byte Pair Encoded raw text input sequence, how many double Ls are there in the collected works of Gilbert and Sullivan." The silence stretches. She looks away and a single tear wells in her artificial eye. My father sobs. The people continue to judge models on character level tasks. |
Imagine having a conversation like that with a human who for whatever reason (some sort of dyslexia, perhaps) has trouble with spelling. Don't you think that after you point out New York and New Jersey even a not-super-bright human being would notice the pattern and go, hang on, are there any other "New ..." states I might also have forgotten?
Gemini 2.5 Pro, apparently, doesn't notice anything of the sort. Even after New York and New Jersey have been followed by New Mexico, it doesn't think of New Hampshire.
(The point isn't that it forgets New Hampshire. A human could do that too. I am sure I myself have forgotten New Hampshire many times. It's that it doesn't show any understanding that it should be trying to think of other New X states.)