| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kirillkh 3160 days ago

Some clarifying questions:

1) Do words in a generic corpus (such as Wikipedia) actually form well-separated clusters?

2) Is it correct that you find word clusters in the corpus as a preprocessing step (as opposed to at indexing or query time)?

3) Do I understand correctly that you use all words in clusters as synonyms and pass them to Solr at query/indexing time? Is it query time, index time or both?

4) Given a language where words have many syntactic forms (e.g. buy-bought-buying), how does it work with clusters? Do both syntactic forms and synonyms end up in the same cluster? Wouldn't it be beneficial to treat many of these different forms as the same word (i.e. perform stemming) and only list truly different, but closely related concepts as synonyms?

1 comments

rpedela 3160 days ago

1. It should. The talk recommends using multiple cluster sizes (e.g. 50,500,5000) and give more weight in the query to smaller clusters. Ideally you would run word2vec on your own domain-specific corpus and then cluster, but that only works if your corpus is of sufficient size.

2. Correct. The goal of the pre-processing step is to generate a Solr synonyms file which can be added to your index mapping.

3a. You could use all the words, but in general I would advise against it. Using all the words from Wikipedia or Google News would be similar to using a thesaurus which can add a lot of noise. For example, the word "cocoa" could mean chocolate, a city in Florida, or programming language. It is better to use a list of domain-specific keywords and phrases as a filter for which words are added to the Solr synonyms file. However if your corpus is Wikipedia, Google News, or something equally generic, then using all the words makes sense.

3b. It must be both query and index time. For example, the phrase "java developer" would have the mapping "java developer => cluster_15" in the synonyms file. In order for the search terms "java developer" to match cluster_15, "cluster_15" must be indexed in place of "java developer".

4. The different forms will most likely end up in the same cluster, but stemming would guarantee it.

link

kirillkh 3159 days ago

4) But recall that the language in question is such that stemming is hard. If you expand every form of every word in Hebrew, you obtain something like 600,000 words. And many of them have completely different meanings due to syntactic coincidences and short roots. So, ideally, the first step would a) determine what exactly each word in this document means in the given context, b) replace it with an unambiguous identifier.

For example, in Hebrew the word BRHA can mean several things: "pool", "blessing", "in soft" and "her knee" (no kidding).

link

rpedela 3159 days ago

I don't know anything about Hebrew. But maybe lemmatization would be better than stemming since it takes meaning and context into account. It is also possible that it is an unnecessary step for clustered word vectors for your use case. If it was me, I would try without stemming/lemmatization first.

EDIT

I found this Hebrew analyzer for Lucene/Solr/Elasticsearch [1] which appears to do stemming or lemmatization. Potentially you could use the output of the analyzer as the input to word2vec.

1. https://github.com/synhershko/HebMorph

link

kirillkh 3159 days ago

Please forgive me attempting to milk as much as possible from this discussion - I just don't have many opportunities to get useful advice on this subject, and I've been mulling over it for a long time.

> But maybe lemmatization would be better than stemming

You're right, I'm using "stemming" and "lemmatization" interchangeably where I shouldn't. What I mean is lemmatization.

> It is also possible that it is an unnecessary step for clustered word vectors for your use case

I don't focus on a specific use case, I'm just trying to find a way to enable full-text search for Hebrew. Searching based on concept similarity is a very cool addition, though, and I do have some use cases in mind for it specifically. But I'm just thinking what a typical cluster would look like, and I imagine 99.9% of it will be different forms of the same handful of base forms. Furthermore, telling Lucene to match based on all these forms will inevitably create a large number of false positives due to the aforementioned abundance of homonyms. So I can see a clear problem here even now. That's why I keep reiterating my original question of whether this system can first be used for lemmatizing and then everything else.

link

rpedela 3159 days ago

For English NLP, I often stem first because it usually reduces noise. I think your main concern is that the abundance of homonyms will increase noise which is certainly possible. Because I don't know Hebrew, I don't have any intuition on what may work. My advice is to experiment. Cluster some Hebrew text without lemmatization, cluster with lemmatization using that Hebrew analyzer I linked, and see what the results are. Also maybe a literature review will yield experiments done with Hebrew and word embeddings/vectors. Sorry I cannot be of more help.

EDIT

I found this paper which may answer your question about lemmatization and word vectors.

http://www.openu.ac.il/iscol2015/downloads/ISCOL2015_submiss...

link

kirillkh 3159 days ago

Thanks, I know about HebMorph. Its authors don't want it to be used for commercial purposes (at least for free), so that limits its usability beyond simple experiments. As to your second link, it confirms my suspicions that lemmatizing is important for Hebrew, but the code they reference in the footnotes is equally hostile to commercial usage. I was really hoping word2vec or other new tools would enable building lemmatizer from scratch without much hassle.

Thanks for your advice, anyway.

link