Hacker News new | ask | show | jobs
by eindiran 2398 days ago
All languages with synthetic morphology (both agglutinative languages, which glue chains of morphemes together, and fusional languages, which inflect morphemes) struggle with the language modelling techniques used for English.

A big issue is that in synthetic languages 'words' are much more 'rare' (because there are more morpheme combinations per word). So if you're building something like a bag-of-words or an ngram model, your input data is likely to be very sparse which translates to poor modelling of the language itself/what words speakers would judge as grammatical.

With agglutinative languages like Turkish, a technique that has been used with considerable success is just considering each morpheme a distinct token, but it has many of the same problems as word-level tokenization. I was looking at a paper recently that claimed to have found a good way to do smoothing so that unseen ngrams could be assigned a non-zero probability in a way that conformed to the rules of the language, but we'll have to see if that can work in practice.