|
One straightforward way to get started is to understand embedding without any AI/deep learning magic. Just pick a vocabulary of words (say, some 50k words), pick a unique index between 0 and 49,999 for each of the words, and then produce embedding by adding +1 to the given index for a given word each time it occurs in a text. Then normalize the embedding so it adds up to one. Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible. The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand. |
The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".