Hacker News new | ask | show | jobs
by Arnavion 529 days ago
IIUC the input to LLMs is tokenized not on word boundaries but some kind of inter-syllable boundaries, because then whatever the model associated with "task" will also apply to "tasking", "tasked", "taskmaster", etc for example. So a model making up compounds that don't exist would be fully possible and even desirable, especially since real humans do it with English all the time.
1 comments

They’re called “lemma”
The intent is the same, but as I understand it LLMs don't tokenize based on lemmas, though some of the tokens probably line up with them.