|
|
|
|
|
by kirillkh
3156 days ago
|
|
Please forgive me attempting to milk as much as possible from this discussion - I just don't have many opportunities to get useful advice on this subject, and I've been mulling over it for a long time. > But maybe lemmatization would be better than stemming You're right, I'm using "stemming" and "lemmatization" interchangeably where I shouldn't. What I mean is lemmatization. > It is also possible that it is an unnecessary step for clustered word vectors for your use case I don't focus on a specific use case, I'm just trying to find a way to enable full-text search for Hebrew. Searching based on concept similarity is a very cool addition, though, and I do have some use cases in mind for it specifically. But I'm just thinking what a typical cluster would look like, and I imagine 99.9% of it will be different forms of the same handful of base forms. Furthermore, telling Lucene to match based on all these forms will inevitably create a large number of false positives due to the aforementioned abundance of homonyms. So I can see a clear problem here even now. That's why I keep reiterating my original question of whether this system can first be used for lemmatizing and then everything else. |
|
EDIT
I found this paper which may answer your question about lemmatization and word vectors.
http://www.openu.ac.il/iscol2015/downloads/ISCOL2015_submiss...