I discovered recently GPT-4 is also good at a related task, word segmentation. For example, it can translate this: UNDERNEATHTHEGAZEOFORIONSBELTWHERETHESEAOFTRA
NQUILITYMEETSTHEEDGEOFTWILIGHTLIESAHIDDENTROV
EOFWISDOMFORGOTTENBYMANYCOVETEDBYTHOSEINTHEKN
OWITHOLDSTHEKEYSTOUNTOLDPOWER
To this: Underneath the gaze of Orion's belt, where the Sea of Tranquility meets the
edge of twilight, lies a hidden trove of wisdom, forgotten by many, coveted
by those in the know. It holds the keys to untold power.
(The prompt was, "Segment and punctuate this text: {text}".)This was interesting because word segmentation is a difficult problem that is usually thought to require something like dynamic programming[1][2] to get right. It's a little surprising that GPT-4 can handle this, because it has no capability to search different alternatives to backtrack if it makes a mistake, but apparently it's stronger understanding of language means that it doesn't really need to. It's also surprising that tokenization doesn't appear to interfere with its ability to these tasks, because it seems like it would make things a lot harder. According to the openAI tokenizer[3], GPT-4 sees the following tokens in the above text: UNDER NE AT HT HE GA Z EOF OR ION SB EL TW HER ET HE SEA OF TRA
Except for "UNDER", "SEA", and "OF", almost all of those token breaks are not at natural word boundaries. The same is true for the scrambled text examples in the original article. So GPT-4 must actually be taking those tokens apart into individual letters and gluing them back together into completely new tokens somewhere inside it's many layers of transformers.[1]: https://web.cs.wpi.edu/~cs2223/b05/HW/HW6/SolutionsHW6/ [2]: https://pypi.org/project/wordsegmentation/ [3]: https://platform.openai.com/tokenizer |
FWIW, the only reason you need DP to get it "right" is because, well, you want it right. A human can of course generally split words with just a language model in 1-pass, as long as you don't have ambiguous text. And on the flipside, you absolutely need a language model to correctly segment text. "ilovesnails" can only be decoded correctly if you understand subject-verb agreement, given that there are two solutions that have dictionary agreement. "I love snails" and "I loves nails"
FWIW, GPT-4 tubro is imperfect.
> Heenjoysgoingtotheparkswimmingdancingandlovesnails
produces
> He enjoys going to the parks, swimming, dancing, and loves snails.
Note how it added an additional "s" in presumably because "snails" is just so much higher probability than "nails" to "love" (no idea why "park" also became "parks"). I found it hard to guide it to the correct solution without explicit prompting.
Amusingly even with guiding, it first broke it's own grammar model, first choosing:
> He enjoys going to the park, swimming, dancing, and love snails.