|
|
|
|
|
by jph00
2309 days ago
|
|
Yes you're right, in our NLP course we used Turkish as our example. But for the book I mentioned Polish due to this paper: https://arxiv.org/abs/1810.10222 . But as you say, now the word "agglutinative" isn't technically correct. I'm actually not sure what the right word is to describe languages that have lots of big compounds with no spaces. (Which is the key issue here, as to why we need subword tokenization techniques). |
|
https://en.wikipedia.org/wiki/Synthetic_language
There's a spectrum between synthetic and analytic languages ( https://en.wikipedia.org/wiki/Synthetic_language#Synthetic_a... ) and those closer to the synthetic end are the ones giving you trouble.
Polish will be subtype of synthetic called fusional/inflected which means things need to be adjusted to fit together, agglutinative languages are those that use mainly agglutination where morphemes are stuck together as is:
https://en.wikipedia.org/wiki/Agglutinative_language
Since it's a spectrum / categorization based on features, all languages will show these features to various degrees. E.g. the famous "anti|dis|establish|ment|ari|an|ism" in english and "anty|samo|u|bez|przedmiot|owia|nie" as a similar example in polish (both from https://pl.wikipedia.org/wiki/Aglutynacyjno%C5%9B%C4%87 ), or more humble "houseboat" or "bitwise".
There are also polysynthetic languages, which is the name for the extreme of this spectrum, but there are no familiar examples of these (Mayan languages, Ainu, Inuit, Aleut are only i recognize from those mentioned on wikipedia).