|
|
|
|
|
by comeonbro
486 days ago
|
|
Imagine if I asked you how many '⊚'s are in 'Ⰹ⧏⏃'? (the answer is 3, because there is 1 ⊚ in Ⰹ and 2 ⊚s in ⏃) Much harder question than if I asked you how many '⟕'s are in 'Ⓕ⟕⥒⟲⾵⟕⟕⢼' (the answer is 3, because there are 3 ⟕s there) You'd need to read through like 100,000x more random internet text to infer that there is 1 ⊚ in Ⰹ and 2 ⊚s in ⏃ (when this is not something that people ever explicitly talk about), than you would need to to figure out that there are 3 ⟕s when 3 ⟕s appear, or to figure out from context clues that Ⰹ⧏⏃s are red and edible. The former is how tokenization makes 'strawberry' look to LLMs: https://i.imgur.com/IggjwEK.png It's a consequence of an engineering tradeoff, not a demonstration of a fundamental limitation. |
|