|
|
|
|
|
by BoiledCabbage
932 days ago
|
|
I'm open to being corrected, but I feel that your statement is missing the point. An embedding can trivially have char embedding that sum to word embeddings, or it can have word embeddings that well represent semantic concepts, but it's not at all trivial to preserve both constraints simultaneously like you make it out to be.
The constraints of adding a letter to a word won't consistently shift it in one direction that will also capture the semantic meaning of that vector shift. Or to give a more concrete example "despair", "aspired", "daipers", and "praised" are all anagrams. If summing the embeddings of characters produces words, then the embedding of all 4 of those words must be identical. That significantly constrains semantic differentiation between those 4 very different words. What's going on is more complex than what you've stated - and put simply, if reserving single characters embeddings was all that was needed to produce this result then all the llms would produce these results successfully. They don't - and that demonstrate that those two models are more "powerful"/"adept" than the others. |
|
These words you presented as example are used in different contexts. You hardly will find something like "pooped despair" or "deep abyss of praised." The context will guide LM into different paths even when embeddings are same, neural LM's will learn that for sure.
(in fact, I used a sorted context prefix in one of LMs I reseached (order-4 or longer features, to save memory used by SNMLM) and I saw little to no difference in perplexity)
Also, the difference between LMs is the training corpus, among other things. We do not know how these things are trained, the corpora is not generally accessible. Oftentimes we do not even know token vocabulary! How many tokens are there, how long they are, etc.
What you ascribe to powerfullness can be a difference in training and data prepocessing.