Hacker News new | ask | show | jobs
by olooney 932 days ago
I discovered recently GPT-4 is also good at a related task, word segmentation. For example, it can translate this:

    UNDERNEATHTHEGAZEOFORIONSBELTWHERETHESEAOFTRA
    NQUILITYMEETSTHEEDGEOFTWILIGHTLIESAHIDDENTROV
    EOFWISDOMFORGOTTENBYMANYCOVETEDBYTHOSEINTHEKN
    OWITHOLDSTHEKEYSTOUNTOLDPOWER
To this:

    Underneath the gaze of Orion's belt, where the Sea of Tranquility meets the
    edge of twilight, lies a hidden trove of wisdom, forgotten by many, coveted
    by those in the know. It holds the keys to untold power.
(The prompt was, "Segment and punctuate this text: {text}".)

This was interesting because word segmentation is a difficult problem that is usually thought to require something like dynamic programming[1][2] to get right. It's a little surprising that GPT-4 can handle this, because it has no capability to search different alternatives to backtrack if it makes a mistake, but apparently it's stronger understanding of language means that it doesn't really need to.

It's also surprising that tokenization doesn't appear to interfere with its ability to these tasks, because it seems like it would make things a lot harder. According to the openAI tokenizer[3], GPT-4 sees the following tokens in the above text:

    UNDER NE AT HT HE GA Z EOF OR ION SB EL TW HER ET HE SEA OF TRA
Except for "UNDER", "SEA", and "OF", almost all of those token breaks are not at natural word boundaries. The same is true for the scrambled text examples in the original article. So GPT-4 must actually be taking those tokens apart into individual letters and gluing them back together into completely new tokens somewhere inside it's many layers of transformers.

[1]: https://web.cs.wpi.edu/~cs2223/b05/HW/HW6/SolutionsHW6/

[2]: https://pypi.org/project/wordsegmentation/

[3]: https://platform.openai.com/tokenizer

15 comments

Interesting for the hard example of #2, it outperforms the project, though I credit this to memorization (given that it is able to reproduce the correct stanza and punctuation for "Spring and Fall, to a Young Child").

FWIW, the only reason you need DP to get it "right" is because, well, you want it right. A human can of course generally split words with just a language model in 1-pass, as long as you don't have ambiguous text. And on the flipside, you absolutely need a language model to correctly segment text. "ilovesnails" can only be decoded correctly if you understand subject-verb agreement, given that there are two solutions that have dictionary agreement. "I love snails" and "I loves nails"

FWIW, GPT-4 tubro is imperfect.

> Heenjoysgoingtotheparkswimmingdancingandlovesnails

produces

> He enjoys going to the parks, swimming, dancing, and loves snails.

Note how it added an additional "s" in presumably because "snails" is just so much higher probability than "nails" to "love" (no idea why "park" also became "parks"). I found it hard to guide it to the correct solution without explicit prompting.

Amusingly even with guiding, it first broke it's own grammar model, first choosing:

> He enjoys going to the park, swimming, dancing, and love snails.

There's an extra 's' on 'park' in the initial attempt as well, presumably the one from 'swimming'.
Try other languages. I literally just went to the German Wikipedia and copied that random "article of the day". So try this and compare:

    JEHUDABENISAAKABRAVANELUM1460INLISSABONNACH1521INNEAPELWAREINJÜDISCHERPHILOSOPHARZTUNDDICHTERAUSPORTUGALDERDIEZWEITEHÄLFTESEINESLEBENSINITALIENVERBRACHTEDENNAMENLEOBZWLEONEMITDEMERINDERMODERNENLITERATURMEISTBEZEICHNETWIRDLIEGERSTERSERALSERWACHSENERANABRAVANELGEHÖRTEZUDENPROMINENTENVERTRETERNDIESPLATONISMUSINDIERENAISSANCEERVERFÜGTEÜBEREINEHERVORRAGENDEHUMANISTISCHEBILDUNGUNDKENNTESSICHINDERCHRISTLICHENEBENSOWIEINDERJÜDISCHENUNDISLAMISCHENPHILOSOPHISCHENTRADITIONAUSSEINHAUPTWERKDIEDIALOGHIDAMOREDIALOGEÜBERDIELIEBEKNÜPFTANPLATONSKONZEPTDESLITERARISCHKUNSTVOLLGESTALTETENPHILOSOPHISCHENDIALOGENANWIEBEIPLATONUNDINDERTRADITIONELLENBELEHRENDEDIALOGLITERATURBEMÜHENSICHINDENDIALOGHIDAMOREDIEGESPRÄCHSPARTNERGEMEINSAMUMERKENNTNISABRAVANELWANDELTDGEGENDSHERKÖMMONZEPTABINDEMERANDIEDSTELLEDESÜBLICHENLEHRERSCHÜLERVERHÄLTNISSESDIALOGTEILNEHMEREINGEDANKENAUSTAUSCHUNDZUGLEICHGEISTIGENKAMPFZWEIEREBEBÜRTIGERGEPRÄCHSPARTNERSETZT


    
Jehuda ben Isaak Abravanel (* um 1460 in Lissa­bon; † nach 1521 in Neapel) war ein jüdi­scher Philo­soph, Arzt und Dichter aus Portu­gal, der die zweite Hälfte seines Lebens in Italien verbrachte. Den Namen Leo bzw. Leone, mit dem er in der moder­nen Literatur meist bezeich­net wird, legte er sich erst als Erwach­sener an. Abra­vanel gehörte zu den promi­nenten Vertre­tern des Platonis­mus in der Renais­sance. Er verfügte über eine hervor­ragende huma­nisti­sche Bil­dung und kannte sich in der christ­lichen ebenso wie in der jüdi­schen und der islami­schen philo­sophi­schen Tradi­tion aus. Sein Haupt­werk, die Dialoghi d’amore („Dialoge über die Liebe“), knüpft an Platons Konzept des litera­risch kunst­voll gestal­teten philo­sophi­schen Dialogs an. Wie bei Platon und in der tradi­tionel­len belehren­den Dialog­litera­tur bemühen sich in den Dialoghi d’amore die Ge­sprächs­part­ner gemein­sam um Erkennt­nis. Abra­vanel wandelt jedoch das her­kömm­liche Konzept ab, indem er an die Stelle des übli­chen Lehrer-Schüler-Ver­hältnis­ses der Dialog­teilneh­mer einen Gedanken­aus­tausch und zugleich geisti­gen Kampf zweier eben­bürtiger Ge­sprächs­partner setzt.

Hint: It doesn't fare quite as well ;)

I see this slightly the other way round - the difficulties caused by tokenisation are why it is good at segmentation. Words break and jump around due to it, and more so with typos in the vast amounts of training data.

Also regarding backtracking... It sees all the input at once, so not sure why it needs to backtrack?

its referring to the search space of valid segmentations, which if set up as a classical problem, it would be some type of DP with backtracking from deadend paths. The full input is known in both cases, its just that gpts arch doesn't need to search any segmentation space, its billions of parameters aproximate the function needed to arrive at the correct answser
You wouldn’t even be able to solve this in the standard leetcode DP problem way, because it’s ambiguous if all you know is which words are valid. For example THESEA could be either “THE SEA” or “THESE A”. You need to have a model of English grammar to realize that the former is much more likely to be part of a valid sentence than the latter.
I think even a bigram model would provide enough information.
Is that "big-ram" or "bi-gram"?
it's big RAM now! (a bi-gram is the probability of a word given the previous 2.
Bi-gram aka pairs of words
If you put it into the tokenizer https://platform.openai.com/tokenizer you can see that it helps in some places but not in others. It pulled out "SEA/OF/TRAN/QU/ILITY", but I think it broke up every instance of the word "THE".
I used a funny poem I read in a obscure book. Gpt may have sent it.

Lordgivemeplentymybellyisemptysixinchesbelowthetablelorsbepraisedmybellyisraisedsixinchesabovethetable

ChatGPT 3.5 segments it perfectly

"Lord give me plenty, My belly is empty, Six inches below the table. Lord be praised, My belly is raised, Six inches above the table."

>lorsbepraised

is not "lord be praised"

also spellchecking
Well that doesn’t apply to math, where LLMs are still garbage due to subpar tokenization.

GPT-4 also still fails at multiple syntactic or phonetic constraints at once, due to its tokenization scheme.

Just curious, have you tried a more uncommon paragraph? Could it be the case that the model simply learned the poem due to it being in the training set?
That fragment isn't any poem I know of. Google shows one result for an exact search of the opening, which is this post.
I found more of it in here: https://www.oranlooney.com/post/playfair/

However, I can't determine where this is originally from...

it's… from men in black iirc, so if it knows movie scripts, it'd know this.
Here is the result of asking for word segmentation with the text of your comment and the text of this comment, minus the link.

https://chat.openai.com/share/b17ecc0b-570c-4e20-9556-23bfa1...

Seems easy enough to do a more rigorous test. Just find a large set of novel text, write a program to segment it by sentence as well as uppercasing and removing spaces/punctuation.

Then run it through the GPT-4 API and compare the output to the original.

Feeding the above scrambled text "UNDERNEATHTHEGAZEOFORIONSBELTWHERETHESEAOF"into the Google search box also segments and punctuates correctly (Showing results for..correct text..).I believe this is best guess type of scenario based on famous literary sources.
Did you look at the search results? I didn't see anything resembling the text in any of the search results. I don't think this lines comes from a famous literary source. Likely google just has logic to handle missing whitespace.
Posing the same challenge to deepseek-llm:67b-chat succeeds as well (ollama run deepseek-llm:67b-chat "$(cat prompt-segment.txt)"). This is something even smaller models are pretty good at.
We can go even smaller. With Mistral 7B:

"Underneath the gaze of the forions' belt, where the sea of tranquility meets the edge of twilight, lies a hidden trove of wisdom forgotten by many, coveted by those in the know. The old still holds the key to told power."

Sure its not 100% correct though

Glad I found your comment cause I just ran a test with the same model and it worked first shot. What a great model.
For what it's worth, I tried this with the self-hosted `openorcaxopenchat-preview2-13b.ggmlv3.q5_K_S.bin` and it worked _almost_ identically. It misses some of the punctuation in yours, and it botches it a bit at the end (`...know withhold...`), but otherwise...

  Underneath the gaze of Orion's Belt, where the sea of tranquility meets the edge of twilight lies a hidden trove of wisdom forgotten by many coveted by those in the know withhold the keys to untold power.
GPT-3 (ChatGPT) also succeeds at deciphering your example text. I didn't think it was that impressive until I realized the tokens were going across word boundaries like you said.
> GPT-3 (ChatGPT)

ChatGPT's lower model is GPT-3.5-turbo, it is not GPT-3.

GPT-3 was silently replaced with RLHF'd ("instruct") versions (while keeping the same model names in the Playground) because they were cheaper to run (and had better "safety").

I think this means that current GPT-3 is more similar to GPT-3.5-turbo than to old (retired) GPT-3, but I'm not entirely sure about the details.

According to the openAI tokenizer[3], GPT-4 sees the following tokens in the above text:

Seems to me, this task depends heavily on the tokenizer, and I'm a bit sceptical if that is really the tokenizer's output. Isn't BPE supposed to result in the longest letter sequences that are in the dictionary?

If you assume that common words like "underneath" and "the" are in the dictionary, the "greedy" tokenization would match the actual words.

Interesting! This also seems to work with smaller quantised models. I just tried it with a 4-bit quantised version of WizardLM 13B v1.2 and it seems to work quite well.
GPT-3 Davinci from 2020 could do this too:

Segment this text into proper sentences:

Input:

UNDERNEATHTHEGAZEOFORIONSBELTWHERETHESEAOFTRANQUILITYMEETSTHEEDGEOFTWILIGHTLIESAHIDDENTROVEOFWISDOMFORGOTTENBYMANYCOVETEDBYTHOSEINTHEKNOWITHOLDSTHEKEYSTOUNTOLDPOWER

Output:

1. Underneath the gaze of Orion's belt, where the sea of tranquility meets the edge of twilight, lies a hidden trove of wisdom forgotten by many, coveted by those in the know. 2. It holds the keys to untold power.

Mistral 7B 4bit quant on an iPhone 15 Pro can also translate this. https://tinypic.host/image/AZGhb
What can explain this? Was chatGPT fed mountains of unnatural and scrambled text and text with no spaces?

No it wasn't. It figured it out though actual intelligence.

No, GPT just stores a dictionary of most common letter sequences (tokens) - not always words, sometimes part of words. In GPT2 there was about 50 000 tokens - https://huggingface.co/roberta-base/raw/main/vocab.json . GPT4 uses vocabulary of 100 000 tokens (according to some sources, which I can't verify). While you may find it unusual for English, for some other languages like Japanese splitting text without spaces into tokens started many years ago. Otherwise processing of text is basically impossible there (there are no spaces in Japanese texts).
The token system used by large language models like GPT-4 is designed to be comprehensive enough to represent virtually any text, including every possible word that could exist in a language. This is separate from the training the neural net and is chosen deliberately.

The training process teaches LLMs how to compose these tokens to form replies to our queries. The training data used in the training process does not have obscured words or sentences with strange spacing. The LLM is still able compose the tokens correctly from varied input that never existed in the training data.

It is intelligence.

Yes and no. I know that there is no word "understand" in the dictionary, only "under" and "stand", but other than that it is just a large table with probabilities to see tokens in specific context.

And even then ChatGPT fails to segment "policecaughttherapist" (https://chat.openai.com/share/21c7596a-6474-4639-8a92-5cea54...), even though:

1) If I talked about a therapist, sentence would look like "police caught _the_ therapist"

2) How often do the police even catch therapists? Come on, it looks like the training set was just heavily censored. No intelligence, just a broken ngram database (where n = length of articles in training set, see https://news.ycombinator.com/item?id=38458683).

>Yes and no. I know that there is no word "understand" in the dictionary, only "under" and "stand", but other than that it is just a large table with probabilities to see tokens in specific context.

The training data is more important then the dictionary because the dictionary is designed to be able to form every possible combination of words and sentences that can be formed. It is not limited to specific words it is building words and sentences from building blocks.

1. That parsing is valid. Though unlikely. The choice it made is not incorrect. Thus not a sign of lack of intelligence.

2. Not often. But if you ask chatGPT to reinterpret the word in another way that is grammatically correct it will find the rapist. It shows definitively there is no censorship of the word.

3. I actually didn't see the alternative myself for some reason. Therapist jumped out at me and I didn't see what you were talking about for a good couple of minutes. I mean, unless you want to think of me (a human) as not "intelligent" then clearly it's not a factor here.

I'll will test this out later. This is a huge problem for us because unstructured incorrectly segments words when reading pdf files.
I'd think it's really good at tokenization to word. That's exactly what it's trained to do.
I would not be surprised if it learned to beam search.