|
|
|
|
|
by rpedela
3160 days ago
|
|
I don't know anything about Hebrew. But maybe lemmatization would be better than stemming since it takes meaning and context into account. It is also possible that it is an unnecessary step for clustered word vectors for your use case. If it was me, I would try without stemming/lemmatization first. EDIT I found this Hebrew analyzer for Lucene/Solr/Elasticsearch [1] which appears to do stemming or lemmatization. Potentially you could use the output of the analyzer as the input to word2vec. 1. https://github.com/synhershko/HebMorph |
|
> But maybe lemmatization would be better than stemming
You're right, I'm using "stemming" and "lemmatization" interchangeably where I shouldn't. What I mean is lemmatization.
> It is also possible that it is an unnecessary step for clustered word vectors for your use case
I don't focus on a specific use case, I'm just trying to find a way to enable full-text search for Hebrew. Searching based on concept similarity is a very cool addition, though, and I do have some use cases in mind for it specifically. But I'm just thinking what a typical cluster would look like, and I imagine 99.9% of it will be different forms of the same handful of base forms. Furthermore, telling Lucene to match based on all these forms will inevitably create a large number of false positives due to the aforementioned abundance of homonyms. So I can see a clear problem here even now. That's why I keep reiterating my original question of whether this system can first be used for lemmatizing and then everything else.