|
|
|
|
|
by PaulHoule
2138 days ago
|
|
Word embeddings suck. Take a look at the graphs under "
2. Linear substructures" in https://nlp.stanford.edu/projects/glove/ and note that they all involve looking at a small number of points. It is easy to reproduce plots like that but if you try to increase the number of points the result breaks down completely. It is a curve fitting problem: for a small enough set of points compared to the number of dimensions, you can find a matrix that projects a set of random points to an exactly specified set of points in the plane. If you relax the problem to something like "put colors on the left side, put smells on the right side" you will get better than random performance from that kind of model, but not that much better than random. Word embeddings are a strategy that approaches an asymptote. Systems that are destined to low performance will perform better if you use a word embedding, but they throw away information up front that makes high performance impossible. |
|
The linear relationship between things like king/queen etc is a cute demo but not really useful or used in practice.
The real usefulness of word embeddings is that similar concepts are close to each other so they make a great representation for other models (vs something like TF-IDF). These days they have been mostly surpassed in terms of state of the art by full language models, but the point is that simple techniques like average embedding of words in sentences generalised really well to unseen data.
And if you add in subword embeddings they generalise to unseen words, too.
We could talk about how context lets language models do this even better, but I'm still back trying to persuade the OP that this isn't just memorisation and good ML models work well on unseen data!