|
|
|
|
|
by staticshock
72 days ago
|
|
LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system. |
|
Try it for yourself — under the most popular tokenizer vocabulary (https://tiktokenizer.vercel.app/?model=cl100k_base), "strawberry" becomes [str][aw][berry]. Or, from the model's perspective, [496, 675, 15717]. The model doesn't know anything about how those numbers correspond to letters than you do! It never gets sat down and told "[15717] <=> [b][e][r][r][y]", with single-byte tokens on the right. (In fact, these single-byte tokens appear in the training data extremely rarely, and so the model doesn't often learn to do anything with them.)
Note that LLMs can predictably count the number of r's in "s t r a w b e r r y", because <Count the number of r's in "s t r a w b e r r y"> becomes [Count][ the][ number][ of][ r]['s][ in][ "][s][ t][ r][ a][ w][ b][ e][ r][ r][ y]["]. And that's just a matching problem — [ r] tokens for [ r] tokens, no token-correspondence-mapping needed.