|
|
|
|
|
by rpedela
3156 days ago
|
|
For English NLP, I often stem first because it usually reduces noise. I think your main concern is that the abundance of homonyms will increase noise which is certainly possible. Because I don't know Hebrew, I don't have any intuition on what may work. My advice is to experiment. Cluster some Hebrew text without lemmatization, cluster with lemmatization using that Hebrew analyzer I linked, and see what the results are. Also maybe a literature review will yield experiments done with Hebrew and word embeddings/vectors. Sorry I cannot be of more help. EDIT I found this paper which may answer your question about lemmatization and word vectors. http://www.openu.ac.il/iscol2015/downloads/ISCOL2015_submiss... |
|
Thanks for your advice, anyway.