Hacker News new | ask | show | jobs
by thesz 932 days ago
There are character embeddings that allow one to recover word embedding just by summing embeddings of individual bytes/chars in the word: https://github.com/sonlamho/Char2Vec

The encodings of LM's tokens reserve individual characters so that scrambled or new words can be encoded. And most LM's are trained on scrambled words as part of training copus, thus, they learn character-level embeddings.

Thus, basically, the paper is a very old news. This behavior is expected.

2 comments

You’re only being downvoted because the average NLP knowledge here is low, but you are 100% correct that this paper is very old news.
Thanks.

I haven't noticed downvotes, though. I thought I was just ignored. ;)

I'm open to being corrected, but I feel that your statement is missing the point. An embedding can trivially have char embedding that sum to word embeddings, or it can have word embeddings that well represent semantic concepts, but it's not at all trivial to preserve both constraints simultaneously like you make it out to be. The constraints of adding a letter to a word won't consistently shift it in one direction that will also capture the semantic meaning of that vector shift.

Or to give a more concrete example "despair", "aspired", "daipers", and "praised" are all anagrams. If summing the embeddings of characters produces words, then the embedding of all 4 of those words must be identical. That significantly constrains semantic differentiation between those 4 very different words.

What's going on is more complex than what you've stated - and put simply, if reserving single characters embeddings was all that was needed to produce this result then all the llms would produce these results successfully. They don't - and that demonstrate that those two models are more "powerful"/"adept" than the others.

Transfomers use positional embeddings. The embedding of "a" in first position in word will be (slightly) different from embedding of "a" in second position of the word, roughly. These positional encodings are also sums of the actual embedding of a token (which can be a character) and encoding of a position.

These words you presented as example are used in different contexts. You hardly will find something like "pooped despair" or "deep abyss of praised." The context will guide LM into different paths even when embeddings are same, neural LM's will learn that for sure.

(in fact, I used a sorted context prefix in one of LMs I reseached (order-4 or longer features, to save memory used by SNMLM) and I saw little to no difference in perplexity)

Also, the difference between LMs is the training corpus, among other things. We do not know how these things are trained, the corpora is not generally accessible. Oftentimes we do not even know token vocabulary! How many tokens are there, how long they are, etc.

What you ascribe to powerfullness can be a difference in training and data prepocessing.