|
|
|
|
|
by kirillkh
3160 days ago
|
|
4) But recall that the language in question is such that stemming is hard. If you expand every form of every word in Hebrew, you obtain something like 600,000 words. And many of them have completely different meanings due to syntactic coincidences and short roots. So, ideally, the first step would a) determine what exactly each word in this document means in the given context, b) replace it with an unambiguous identifier. For example, in Hebrew the word BRHA can mean several things: "pool", "blessing", "in soft" and "her knee" (no kidding). |
|
EDIT
I found this Hebrew analyzer for Lucene/Solr/Elasticsearch [1] which appears to do stemming or lemmatization. Potentially you could use the output of the analyzer as the input to word2vec.
1. https://github.com/synhershko/HebMorph