Hacker News new | ask | show | jobs
by habeanf 2876 days ago
Nice work! If I'm not mistaken, the root requires morphological disambiguation, which may change depending on the context/phrase in which the word is observed.

This is an active area of research in Morphologically Rich Languages (MRLs), since this problem also appears in other semitic languages like Hebrew, as well as Turkish. There's a nice body of work to learn from, both with and without neural nets. For example, this paper from 2017 (http://aclweb.org/anthology/D17-1073) uses a neural model for morphological disambiguation. You can see a nice comparison of tools in the recent 2018 Universal Dependencies Shared Task results: http://universaldependencies.org/conll18/results-lemmas.html (look for ar_padt).

If you're looking for training data, the Arabic treebanks in http://universaldependencies.org could help. I think some of them contain surface tokens with lemmas. I'm quite sure they also have roots.

Also, you might want to take a look at the SIGMORPHON CONLL shared task (2017 https://sites.google.com/view/conll-sigmorphon2017/ and 2018 https://sigmorphon.github.io/sharedtasks/2018/) on morphological reinflection, which IIRC is a similar task - taking an inflected form and reinflecting it with other morphological properties. They also have a nice data set to train on.

2 comments

A recent major conference would be a good place to start to see what has been done and what resources are available.

There should be ones specializing in Arabic or morphology, but the generic LREC is a quite wide one, e.g. http://lrec2018.lrec-conf.org/en/conference-programme/accept... has papers that seem relevant like "Build Fast and Accurate Lemmatization for Arabic" (http://www.lrec-conf.org/proceedings/lrec2018/pdf/1079.pdf), "Part-of-Speech Tagging for Arabic Gulf Dialect Using Bi-LSTM" (http://www.lrec-conf.org/proceedings/lrec2018/pdf/483.pdf), or "A Morphologically Annotated Corpus of Emirati Arabic" (http://www.lrec-conf.org/proceedings/lrec2018/pdf/529.pdf).

Wow, thanks! A wealth of resources here.