| HN Mirror

Transfomers use positional embeddings. The embedding of "a" in first position in word will be (slightly) different from embedding of "a" in second position of the word, roughly. These positional encodings are also sums of the actual embedding of a token (which can be a character) and encoding of a position.

These words you presented as example are used in different contexts. You hardly will find something like "pooped despair" or "deep abyss of praised." The context will guide LM into different paths even when embeddings are same, neural LM's will learn that for sure.

(in fact, I used a sorted context prefix in one of LMs I reseached (order-4 or longer features, to save memory used by SNMLM) and I saw little to no difference in perplexity)

Also, the difference between LMs is the training corpus, among other things. We do not know how these things are trained, the corpora is not generally accessible. Oftentimes we do not even know token vocabulary! How many tokens are there, how long they are, etc.

What you ascribe to powerfullness can be a difference in training and data prepocessing.