| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rpedela 3156 days ago

For English NLP, I often stem first because it usually reduces noise. I think your main concern is that the abundance of homonyms will increase noise which is certainly possible. Because I don't know Hebrew, I don't have any intuition on what may work. My advice is to experiment. Cluster some Hebrew text without lemmatization, cluster with lemmatization using that Hebrew analyzer I linked, and see what the results are. Also maybe a literature review will yield experiments done with Hebrew and word embeddings/vectors. Sorry I cannot be of more help.

EDIT

I found this paper which may answer your question about lemmatization and word vectors.

http://www.openu.ac.il/iscol2015/downloads/ISCOL2015_submiss...

1 comments

kirillkh 3156 days ago

Thanks, I know about HebMorph. Its authors don't want it to be used for commercial purposes (at least for free), so that limits its usability beyond simple experiments. As to your second link, it confirms my suspicions that lemmatizing is important for Hebrew, but the code they reference in the footnotes is equally hostile to commercial usage. I was really hoping word2vec or other new tools would enable building lemmatizer from scratch without much hassle.

Thanks for your advice, anyway.

link

rpedela 3156 days ago

I think using word vectors for lemmatization is an interesting idea and on the cutting edge. Here is a paper which discusses it. https://link.springer.com/chapter/10.1007/978-3-662-49192-8_...

link

kirillkh 3155 days ago

Thanks! That paper is extremely helpful. Still, there is one thing missing to complete the picture for me right now. At the input, I have a list of words that I want to index or query. When indexing, they usually form a sentence, when querying, they might just be keywords. But in both cases the words will usually be selected by the user/author in such a way that a human that reads all the words from the input together is able to disambiguate the meaning of every word. This is precisely what I'm missing.

Let's say the user entered three words: A B C. You look up each of them among the vectors and discover that there are three matching vectors for A, four for B and five for C (and for the sake of generality let's assume that there are more words than just 3 in the input, so it's impractical to test every subset of these words for co-occurrence). How do you jointly select the correct vector for each of the words?

link

kirillkh 3155 days ago

Actually, I have an idea, albeit not without some doubts.

Let x1 be the number of vectors matching A, x2 the number of vectors matching B, etc, till xn. Let c1..cn be a particular selection of vectors. Now my main assumption here is that in order to determine which of these vectors are most often encountered together in the same context [1], our goal is to find j that maximizes sum_{i from 1 to n, i!=j}[d_i], where d_i=(c_j dot c_i) if the dot product is nonnegative, otherwise d_i=0. I'm not sure it's true primarily because I don't know if by summing up these dot products we add apples to apples or apples to oranges.

Then in order to find the best selection of vectors c1..cn we can iterate on every vector v_k matching A and dot v_k with every vector matching B, then pick the maximum m2 (or 0 if it's negative); dot v_k with every vector matching C, then pick the maximum m3; etc. Thus, for k'th iteration we obtain the selection of vectors that maximizes M_1k=sum_i[m_i]. After we're done with all c1 iterations, we pick the best such selection M1=max_k[M_1k]. This is all done in O(x1(x2+x3+...xn)) time.

Next, we repeat the above process for all x2 vectors matching B and obtain M2, etc, etc. Ultimately, we pick the selection of vectors that produced the highest M_t across all choices of t. Overall, we get O((sum_i[xi])^2), which seems fast enough. What do you think?

[1] One obvious problem is this limits the number of contexts we match against to just one.

link

rpedela 3154 days ago

Each token would have one vector from word2vec. A token could be a word or phrase depending on the pre-processing. The words in a phrase are usually concatenated with an underscore. I recommend gensim if want/need phrases.

link

kirillkh 3154 days ago

Ah, you're right, word2vec assigns one vector to each word, as opposed to one vector to each meaning. Then the problem remains: we can't differentiate between homonyms.

But it seems it's been solved, too: https://github.com/sbos/AdaGram.jl

link

rpedela 3154 days ago

There is also sense2vec which I think tries to do something similar. https://explosion.ai/blog/sense2vec-with-spacy

link

kirillkh 3155 days ago

Also, this makes me wonder what other things you can do with vectors. If you compute dot product between a verb or a noun with vector "singular"-"plural", will it give a positive value for plurals and a negative for singulars (or vice versa)?

link

rpedela 3154 days ago

No idea. Experiment!

link