|
|
|
|
|
by mci
2304 days ago
|
|
The term you are looking for may be "highly inflected". Side note: IMHO, you are exaggerating the ability of Polish to form long compounds. Dissecting the "Bezbarwne zielone idee wściekle śpią" example from https://arxiv.org/pdf/1810.10222.pdf#page=3 reveals no words longer than 4 morphemes: bez-BARW-n-e ZIEL-on-e IDE-e WŚCIEK-l-e ŚP-ią, where I put word roots in uppercase and bound morphemes in lowercase. The longest sequences of morphemes (for a loose definition of morpheme) I can think of are conditional mood of verbs with double prefixes like po-wy-CHODZI-ł-y-by-ście. However, the sequences of bound morphemes in those forms, which may look complex to you, form a finite-state language that admits just a few sequences. |
|
Your "powychodziłybyście" example could be translated as "you (feminine, plural) would have been going out". With the word tokenization, you get (ignoring comma and brackets) 8 tokens in English and one token in Polish. Now you can have three persons, two genders, two numbers, an imperfective or perfective verb, etc. resulting in combinatorial growth of word tokens in Polish. If you have all word forms for "go out" and you want to add "go in", in English you would add a single token "in", and in Polish you add all the tokens with "-wy-" replaced by "-w-". As a result in Polish you end up with much bigger vocabulary. Additionally you need bigger training corpus as you cannot learn the tokens independently. For example, if you know the meaning of "he ate" and "she wrote", you should be able to guess the meaning of "he wrote", as you've seen all of the tokens. In Polish it's "Zjadł", "Napisała" and "Napisał" - all of the word tokens are different.
Using the subword tokenization instead of word-level tokenization is kind of similar to using a normalized database instead of unnormalized one. It's not about one form being more complex than the other as they're equivalent. After all, will written English be much more complex if we remove all whitespaces? :)