Hacker News new | ask | show | jobs
by miket 1271 days ago
Any question asking about the letters of words is bound to underwhelm because GPT3 is trained on sub-word tokens, so it does not have random access to individual letters. The word "prime" is tokenized as a single token, instead of 5 characters:

https://beta.openai.com/tokenizer

2 comments

I’ve seen this said before, but it’s quite clear from the responses that it understands that prime has 5 letters.
Now try a gibberish word. A human could count the number of letters but ChatGPT can only approximate it.
Me: How many letters are in the word “smorgle” ?

GPT: There is no word "smorgle" in the English language.

Me: I never said it was English.

GPT: I apologize for assuming that "smorgle" was a word in the English language. Without any further context or information about the word "smorgle," I am unable to determine how many letters it has or provide any additional information about it.

Me: Count the letters

GPT: If you would like me to count the letters in the word "smorgle," it has 7 letters.

I’m so trained to double checking anything it says, I quickly confirmed that smorgle does indeed have 7 letters.
Me: I’ve made up a word called “mlaizapf”. Can you count the number of letters?

ChatGPT: Sure! "Mlaizapf" has 8 letters.

--

Me: How about this word: “spoofydoofaplixader”

ChatGPT: Spoofydoofaplixader" has 18 letters.

--

Me: How about the following words: "ghottiaopepinopepotorzey", "zazzamataz", and "maaq"

Gpt: Here are the number of letters in each word:

* "Ghottiaopepinopepotorzey" has 23 letters.

* "Zazzamataz" has 10 letters.

* "Maaq" has 4 letters.

Is there anything else I can help with?

------

It got all of them correct except for "ghottiaopepinopepotorzey" which has 24 letters, not 23.

Very interesting... It seems similar to its math abilities, where it struggles with bigger numbers or more complex problems.

I asked it a bunch of gibberish words and it got them all correct.
My mental model is that if you give it real words, it uses approximately one token per word, and it may or may not know how many letters are in the word - it will have learned how many letters there are only if that information was in its training. Just like any other fact it learns about words. It is not counting the letters.

If you give it a gibberish word, it will represent it with one letter per token and be actually able to more or less count tokens in order to figure out how many letters there are.

So this ends up looking like it can count letters in most words, real and fake. Perhaps it would do poorly with real but uncommon words.

>more or less count tokens

Which is what I meant by saying "approximate" because it can "count" the number of tokens.

> it does not have random access to individual letters

this presumes it works by understanding the components of the question and reasoning based on them. But it doesn't access down to that level, instead just guessing the most likely next word based on statistical tricks. so it doesn't need to "know" about letters to generate a reasonable response involving letters.

What do you think hidden layers do?
not familiar with that - what is it?