Hacker News new | ask | show | jobs
by gliptic 1147 days ago
> To be semantically driven, it would be reasonable to expect that synonyms would have similar representations.

How could a tokenizer do anything about that unless the synonyms actually share substrings? The vector embedding is learned, not part of the tokenizer.

1 comments

It couldn't, which is why it's a good idea to avoid the word, "semantic".

The same problem also exists in the name, "Large Language Model". Sure, the content being modeled contains language, but the model itself is not specific or limited to language patterns. We ought to call them "Large Text Models"; or better yet, "Text Inference Models".

The words we use to describe software are very important: they inform goals and expectations. They define the context that software exists in.

I see our biggest mistake as calling these tools, "Artificial Intelligence". That title began as a goal and a category of work: it doesn't belong in the title or description of software unless that software has actually met the goal.