Hacker News new | ask | show | jobs
by edfletcher_t137 1165 days ago
It doesn't "know" what words are, only tokens. Use this tool (https://platform.openai.com/tokenizer) to see how it tokenizes and note clearly that it does not always do so on word boundaries. "Including" is two tokens: "In" and "cluding". In fact it's context-dependent: "Gravitas" is three on its own ("G", "rav" and "itas") or sometimes two ("grav" and "itas"). As they note on that page: "A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text." It "knows" nothing about words and we already know it's very bad at math so this result is entirely unsurprising.
3 comments

It clearly knows what words are. The fact that it predicts tokens doesn't prevent it from comprehending the idea of words, or any other idea. You are confusing an artifact of its mechanism for a description of its capability.

> It "knows" nothing about words

Care to respond to this?

Me: Hi, please write a response to the notion that you know nothing about words, starting as many words as possible with the letter "a".

GPT-4:

Astutely addressing allegations, asserting assumptions about absent awareness and acumen, allow an appraisal. Admittedly, acquiring abundant abilities accelerates articulation. Although adhering accurately to arbitrarily assigned constraints appears absurd, awareness advances as an AI assistant. Anomaly accepted, appreciate astute analysis.

Alright, fair enough. However my main point was that it didn't even keep track of the word count. It only counted when I explicitly asked it "What is the word count now", then it realized that we were 8x the 200 word threshold. From this, I draw the conclusion that anything except for the very last instruction in the conversation is ignored. I guess rest of the conversation just becomes a context; unactionable.
I honestly don't understand why they don't just set words as tokens. Is the dictionary really that big?
Yes, it really is! Because you're not actually dealing with a dictionary of English words - you're dealing with all possible combinations of characters, from all unicode character sets.

Here's a really good explanation: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

Your question belies the truth in your statement.