Hacker News new | ask | show | jobs
by frabcus 932 days ago
I see this slightly the other way round - the difficulties caused by tokenisation are why it is good at segmentation. Words break and jump around due to it, and more so with typos in the vast amounts of training data.

Also regarding backtracking... It sees all the input at once, so not sure why it needs to backtrack?

4 comments

its referring to the search space of valid segmentations, which if set up as a classical problem, it would be some type of DP with backtracking from deadend paths. The full input is known in both cases, its just that gpts arch doesn't need to search any segmentation space, its billions of parameters aproximate the function needed to arrive at the correct answser
You wouldn’t even be able to solve this in the standard leetcode DP problem way, because it’s ambiguous if all you know is which words are valid. For example THESEA could be either “THE SEA” or “THESE A”. You need to have a model of English grammar to realize that the former is much more likely to be part of a valid sentence than the latter.
I think even a bigram model would provide enough information.
Is that "big-ram" or "bi-gram"?
it's big RAM now! (a bi-gram is the probability of a word given the previous 2.
Bi-gram aka pairs of words
I think it was a rhetorical question.
If you put it into the tokenizer https://platform.openai.com/tokenizer you can see that it helps in some places but not in others. It pulled out "SEA/OF/TRAN/QU/ILITY", but I think it broke up every instance of the word "THE".
I used a funny poem I read in a obscure book. Gpt may have sent it.

Lordgivemeplentymybellyisemptysixinchesbelowthetablelorsbepraisedmybellyisraisedsixinchesabovethetable

ChatGPT 3.5 segments it perfectly

"Lord give me plenty, My belly is empty, Six inches below the table. Lord be praised, My belly is raised, Six inches above the table."

>lorsbepraised

is not "lord be praised"

also spellchecking
Well that doesn’t apply to math, where LLMs are still garbage due to subpar tokenization.

GPT-4 also still fails at multiple syntactic or phonetic constraints at once, due to its tokenization scheme.