| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by frabcus 932 days ago
	I see this slightly the other way round - the difficulties caused by tokenisation are why it is good at segmentation. Words break and jump around due to it, and more so with typos in the vast amounts of training data. Also regarding backtracking... It sees all the input at once, so not sure why it needs to backtrack?

4 comments

gmadsen 932 days ago

its referring to the search space of valid segmentations, which if set up as a classical problem, it would be some type of DP with backtracking from deadend paths. The full input is known in both cases, its just that gpts arch doesn't need to search any segmentation space, its billions of parameters aproximate the function needed to arrive at the correct answser

link

umanwizard 932 days ago

You wouldn’t even be able to solve this in the standard leetcode DP problem way, because it’s ambiguous if all you know is which words are valid. For example THESEA could be either “THE SEA” or “THESE A”. You need to have a model of English grammar to realize that the former is much more likely to be part of a valid sentence than the latter.

link

eutectic 932 days ago

I think even a bigram model would provide enough information.

link

Y_Y 932 days ago

Is that "big-ram" or "bi-gram"?

link

waveBidder 932 days ago

it's big RAM now! (a bi-gram is the probability of a word given the previous 2.

link

somebodythere 932 days ago

Bi-gram aka pairs of words

link

esafak 932 days ago

I think it was a rhetorical question.

link

sp332 932 days ago

If you put it into the tokenizer https://platform.openai.com/tokenizer you can see that it helps in some places but not in others. It pulled out "SEA/OF/TRAN/QU/ILITY", but I think it broke up every instance of the word "THE".

link

dilawar 932 days ago

I used a funny poem I read in a obscure book. Gpt may have sent it.

Lordgivemeplentymybellyisemptysixinchesbelowthetablelorsbepraisedmybellyisraisedsixinchesabovethetable

ChatGPT 3.5 segments it perfectly

"Lord give me plenty, My belly is empty, Six inches below the table. Lord be praised, My belly is raised, Six inches above the table."

link

Metacelsus 932 days ago

>lorsbepraised

is not "lord be praised"

link

waveBidder 932 days ago

also spellchecking

link

Der_Einzige 932 days ago

Well that doesn’t apply to math, where LLMs are still garbage due to subpar tokenization.

GPT-4 also still fails at multiple syntactic or phonetic constraints at once, due to its tokenization scheme.

link