|
|
|
|
|
by ilimilku
2881 days ago
|
|
This is exciting to see. I am a Semitic philologist (Ph.D.) now breaking into the IT industry, and this sort of work is on my radar, though mostly with Hebrew and Aramaic. Arabic, being a Semitic language, has a non-linear morphology, which means that extracting the root has to be done by extracting non-inflectional consonants from all possible positions in a word. If you train a NN with full conjugation paradigms, over a data set, it should be able to begin to recognize what the various inflectional morphemes are. In other words, instead of looking for the root, look for everything that is not the root, and the root is what is left over. For example, the NN should be able to recognize that mu-, ya-, ta-, 'āC-, -ā-. -Ct-, -unna, etc. are all inflectional morphemes. It should also begin to recognize the various matres lectionis or letters indicating long vowels just as alif, waw, and ha. (I'm including vowels in my analysis, because I think like a philologist, not a typical reader of Arabic. Using unvowelled text might be more difficult for the NN.) Anyway, these are just some off-the-cuff thoughts. I look forward to digging deeper into your code and methodology sometime soon. |
|