Hacker News new | ask | show | jobs
by rpedela 3156 days ago
I think using word vectors for lemmatization is an interesting idea and on the cutting edge. Here is a paper which discusses it. https://link.springer.com/chapter/10.1007/978-3-662-49192-8_...
2 comments

Thanks! That paper is extremely helpful. Still, there is one thing missing to complete the picture for me right now. At the input, I have a list of words that I want to index or query. When indexing, they usually form a sentence, when querying, they might just be keywords. But in both cases the words will usually be selected by the user/author in such a way that a human that reads all the words from the input together is able to disambiguate the meaning of every word. This is precisely what I'm missing.

Let's say the user entered three words: A B C. You look up each of them among the vectors and discover that there are three matching vectors for A, four for B and five for C (and for the sake of generality let's assume that there are more words than just 3 in the input, so it's impractical to test every subset of these words for co-occurrence). How do you jointly select the correct vector for each of the words?

Actually, I have an idea, albeit not without some doubts.

Let x1 be the number of vectors matching A, x2 the number of vectors matching B, etc, till xn. Let c1..cn be a particular selection of vectors. Now my main assumption here is that in order to determine which of these vectors are most often encountered together in the same context [1], our goal is to find j that maximizes sum_{i from 1 to n, i!=j}[d_i], where d_i=(c_j dot c_i) if the dot product is nonnegative, otherwise d_i=0. I'm not sure it's true primarily because I don't know if by summing up these dot products we add apples to apples or apples to oranges.

Then in order to find the best selection of vectors c1..cn we can iterate on every vector v_k matching A and dot v_k with every vector matching B, then pick the maximum m2 (or 0 if it's negative); dot v_k with every vector matching C, then pick the maximum m3; etc. Thus, for k'th iteration we obtain the selection of vectors that maximizes M_1k=sum_i[m_i]. After we're done with all c1 iterations, we pick the best such selection M1=max_k[M_1k]. This is all done in O(x1(x2+x3+...xn)) time.

Next, we repeat the above process for all x2 vectors matching B and obtain M2, etc, etc. Ultimately, we pick the selection of vectors that produced the highest M_t across all choices of t. Overall, we get O((sum_i[xi])^2), which seems fast enough. What do you think?

[1] One obvious problem is this limits the number of contexts we match against to just one.

Each token would have one vector from word2vec. A token could be a word or phrase depending on the pre-processing. The words in a phrase are usually concatenated with an underscore. I recommend gensim if want/need phrases.
Ah, you're right, word2vec assigns one vector to each word, as opposed to one vector to each meaning. Then the problem remains: we can't differentiate between homonyms.

But it seems it's been solved, too: https://github.com/sbos/AdaGram.jl

There is also sense2vec which I think tries to do something similar. https://explosion.ai/blog/sense2vec-with-spacy
Also, this makes me wonder what other things you can do with vectors. If you compute dot product between a verb or a noun with vector "singular"-"plural", will it give a positive value for plurals and a negative for singulars (or vice versa)?
No idea. Experiment!