| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by johnnyanmac 522 days ago
	How is AI hallucinating words now? I thought that would have been the easiest thing to restrict with a sufficient dictionary. Or maybe it's an ancient dictionary. I was kind of surprised at the sizes of ]dictionaries I could find while trying to test out a personal project.

1 comments

Arnavion 522 days ago

IIUC the input to LLMs is tokenized not on word boundaries but some kind of inter-syllable boundaries, because then whatever the model associated with "task" will also apply to "tasking", "tasked", "taskmaster", etc for example. So a model making up compounds that don't exist would be fully possible and even desirable, especially since real humans do it with English all the time.

link

staticautomatic 522 days ago

They’re called “lemma”

link

Arnavion 522 days ago

The intent is the same, but as I understand it LLMs don't tokenize based on lemmas, though some of the tokens probably line up with them.

link