Hacker News new | ask | show | jobs
by kirillkh 3160 days ago
4) But recall that the language in question is such that stemming is hard. If you expand every form of every word in Hebrew, you obtain something like 600,000 words. And many of them have completely different meanings due to syntactic coincidences and short roots. So, ideally, the first step would a) determine what exactly each word in this document means in the given context, b) replace it with an unambiguous identifier.

For example, in Hebrew the word BRHA can mean several things: "pool", "blessing", "in soft" and "her knee" (no kidding).

1 comments

I don't know anything about Hebrew. But maybe lemmatization would be better than stemming since it takes meaning and context into account. It is also possible that it is an unnecessary step for clustered word vectors for your use case. If it was me, I would try without stemming/lemmatization first.

EDIT

I found this Hebrew analyzer for Lucene/Solr/Elasticsearch [1] which appears to do stemming or lemmatization. Potentially you could use the output of the analyzer as the input to word2vec.

1. https://github.com/synhershko/HebMorph

Please forgive me attempting to milk as much as possible from this discussion - I just don't have many opportunities to get useful advice on this subject, and I've been mulling over it for a long time.

> But maybe lemmatization would be better than stemming

You're right, I'm using "stemming" and "lemmatization" interchangeably where I shouldn't. What I mean is lemmatization.

> It is also possible that it is an unnecessary step for clustered word vectors for your use case

I don't focus on a specific use case, I'm just trying to find a way to enable full-text search for Hebrew. Searching based on concept similarity is a very cool addition, though, and I do have some use cases in mind for it specifically. But I'm just thinking what a typical cluster would look like, and I imagine 99.9% of it will be different forms of the same handful of base forms. Furthermore, telling Lucene to match based on all these forms will inevitably create a large number of false positives due to the aforementioned abundance of homonyms. So I can see a clear problem here even now. That's why I keep reiterating my original question of whether this system can first be used for lemmatizing and then everything else.

For English NLP, I often stem first because it usually reduces noise. I think your main concern is that the abundance of homonyms will increase noise which is certainly possible. Because I don't know Hebrew, I don't have any intuition on what may work. My advice is to experiment. Cluster some Hebrew text without lemmatization, cluster with lemmatization using that Hebrew analyzer I linked, and see what the results are. Also maybe a literature review will yield experiments done with Hebrew and word embeddings/vectors. Sorry I cannot be of more help.

EDIT

I found this paper which may answer your question about lemmatization and word vectors.

http://www.openu.ac.il/iscol2015/downloads/ISCOL2015_submiss...

Thanks, I know about HebMorph. Its authors don't want it to be used for commercial purposes (at least for free), so that limits its usability beyond simple experiments. As to your second link, it confirms my suspicions that lemmatizing is important for Hebrew, but the code they reference in the footnotes is equally hostile to commercial usage. I was really hoping word2vec or other new tools would enable building lemmatizer from scratch without much hassle.

Thanks for your advice, anyway.

I think using word vectors for lemmatization is an interesting idea and on the cutting edge. Here is a paper which discusses it. https://link.springer.com/chapter/10.1007/978-3-662-49192-8_...
Thanks! That paper is extremely helpful. Still, there is one thing missing to complete the picture for me right now. At the input, I have a list of words that I want to index or query. When indexing, they usually form a sentence, when querying, they might just be keywords. But in both cases the words will usually be selected by the user/author in such a way that a human that reads all the words from the input together is able to disambiguate the meaning of every word. This is precisely what I'm missing.

Let's say the user entered three words: A B C. You look up each of them among the vectors and discover that there are three matching vectors for A, four for B and five for C (and for the sake of generality let's assume that there are more words than just 3 in the input, so it's impractical to test every subset of these words for co-occurrence). How do you jointly select the correct vector for each of the words?

Also, this makes me wonder what other things you can do with vectors. If you compute dot product between a verb or a noun with vector "singular"-"plural", will it give a positive value for plurals and a negative for singulars (or vice versa)?