|
|
|
|
|
by eindiran
2398 days ago
|
|
All languages with synthetic morphology (both agglutinative languages, which glue chains of morphemes together, and fusional languages, which inflect morphemes) struggle with the language modelling techniques used for English. A big issue is that in synthetic languages 'words' are much more 'rare' (because there are more morpheme combinations per word). So if you're building something like a bag-of-words or an ngram model, your input data is likely to be very sparse which translates to poor modelling of the language itself/what words speakers would judge as grammatical. With agglutinative languages like Turkish, a technique that has been used with considerable success is just considering each morpheme a distinct token, but it has many of the same problems as word-level tokenization. I was looking at a paper recently that claimed to have found a good way to do smoothing so that unseen ngrams could be assigned a non-zero probability in a way that conformed to the rules of the language, but we'll have to see if that can work in practice. |
|