Hacker News new | ask | show | jobs
by johnnyanmac 522 days ago
How is AI hallucinating words now? I thought that would have been the easiest thing to restrict with a sufficient dictionary.

Or maybe it's an ancient dictionary. I was kind of surprised at the sizes of ]dictionaries I could find while trying to test out a personal project.

1 comments

IIUC the input to LLMs is tokenized not on word boundaries but some kind of inter-syllable boundaries, because then whatever the model associated with "task" will also apply to "tasking", "tasked", "taskmaster", etc for example. So a model making up compounds that don't exist would be fully possible and even desirable, especially since real humans do it with English all the time.
They’re called “lemma”
The intent is the same, but as I understand it LLMs don't tokenize based on lemmas, though some of the tokens probably line up with them.