| HN Mirror

It's not about the number of letters in the compounds, but about the number of morphemes.

Your "powychodziłybyście" example could be translated as "you (feminine, plural) would have been going out". With the word tokenization, you get (ignoring comma and brackets) 8 tokens in English and one token in Polish. Now you can have three persons, two genders, two numbers, an imperfective or perfective verb, etc. resulting in combinatorial growth of word tokens in Polish. If you have all word forms for "go out" and you want to add "go in", in English you would add a single token "in", and in Polish you add all the tokens with "-wy-" replaced by "-w-". As a result in Polish you end up with much bigger vocabulary. Additionally you need bigger training corpus as you cannot learn the tokens independently. For example, if you know the meaning of "he ate" and "she wrote", you should be able to guess the meaning of "he wrote", as you've seen all of the tokens. In Polish it's "Zjadł", "Napisała" and "Napisał" - all of the word tokens are different.

Using the subword tokenization instead of word-level tokenization is kind of similar to using a normalized database instead of unnormalized one. It's not about one form being more complex than the other as they're equivalent. After all, will written English be much more complex if we remove all whitespaces? :)