| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by moozilla 883 days ago

ChatGPT is specifically bad at these kinds of tasks because of tokenization. If you plug your query into https://platform.openai.com/tokenizer, you can see that"egregious" is a single token, so the LLM doesn't actually see any "e" characters -- to answer your question it would have had to learn a fact about how the word was spelled from it's training data, and I imagine texts explicitly talking about how words are spelled are not very common.

Good explanation here if this still doesn't make sense: https://twitter.com/npew/status/1525900849888866307, or check out Andrej Karpathy's latest video if you have 2 hours for a deep dive: https://www.youtube.com/watch?v=zduSFxRajkE

IMO questions about spelling or number sense are pretty tired as gotchas, because they are all basically just artifacts of this implementation detail. There are other language models available that don't have this issue. BTW this is also the reason DALL-E etc suck at generating text in images.

2 comments

Izkata 883 days ago

> If you plug your query into https://platform.openai.com/tokenizer, you can see that"egregious" is a single token

That says it's 3 tokens.

link

wruza 883 days ago

It doesn’t even matter how many tokens there is, because LLMs are completely ignorant about how their input is structured. They don’t see letters or syllables cause they have no “eyes”. The closest analogy with a human is that vocal-ish concepts just emerge in their mind without any visual representation. They can only “recall” how many “e”s are there, but cannot look and count.

link

alickz 882 days ago

>They can only “recall” how many “e”s are there, but cannot look and count.

Like a blind person?

link

wruza 882 days ago

My initial analogy was already weak, so I guess there's no point in extending it. They key fact here is that tokens are inputs to what essentially is an overgrown matrix multiplication routine. Everything "AI" happens few levels of scientific abstractions higher, and is semantically disconnected from the "moving parts".

link

brewtide 883 days ago

Pre-cogs, I knew it.

link

redox99 883 days ago

" egregious" (with a leading space) is the single token. Most lower case word tokens start with a space.

link

ToValueFunfetti 883 days ago

The number of tokens depends on context; if you just entered 'egregious' it will have broken it into three tokens, but with the whole query it's one.

link

fuzztester 883 days ago

Why three tokens, not one?

link

317070 883 days ago

without the leading space, it is not common enough as a word to have become a token in its own right. Like the vast majority of lowercase words, in OpenAIs tokenizer you need to start " egregious" with a space character for the single token.

link

chgs 883 days ago

Chatgpt could say “I don’t know”

link