Hacker News new | ask | show | jobs
by mci 2304 days ago
> There are also "agglutinative languages", like Polish, which can add many morphemes together to create very long "words" which include a lot of separate pieces of information. [1]

Polish does not work this way. Source: I am Polish. Perhaps jph00 meant Turkish. Issue filed.

[1] https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb

3 comments

Yes you're right, in our NLP course we used Turkish as our example.

But for the book I mentioned Polish due to this paper: https://arxiv.org/abs/1810.10222 . But as you say, now the word "agglutinative" isn't technically correct. I'm actually not sure what the right word is to describe languages that have lots of big compounds with no spaces. (Which is the key issue here, as to why we need subword tokenization techniques).

After reading that section of the book i think the language property you're after is 'highly synthetic':

https://en.wikipedia.org/wiki/Synthetic_language

There's a spectrum between synthetic and analytic languages ( https://en.wikipedia.org/wiki/Synthetic_language#Synthetic_a... ) and those closer to the synthetic end are the ones giving you trouble.

Polish will be subtype of synthetic called fusional/inflected which means things need to be adjusted to fit together, agglutinative languages are those that use mainly agglutination where morphemes are stuck together as is:

https://en.wikipedia.org/wiki/Agglutinative_language

Since it's a spectrum / categorization based on features, all languages will show these features to various degrees. E.g. the famous "anti|dis|establish|ment|ari|an|ism" in english and "anty|samo|u|bez|przedmiot|owia|nie" as a similar example in polish (both from https://pl.wikipedia.org/wiki/Aglutynacyjno%C5%9B%C4%87 ), or more humble "houseboat" or "bitwise".

There are also polysynthetic languages, which is the name for the extreme of this spectrum, but there are no familiar examples of these (Mayan languages, Ainu, Inuit, Aleut are only i recognize from those mentioned on wikipedia).

Many thanks - this is really helpful.
The term you are looking for may be "highly inflected".

Side note: IMHO, you are exaggerating the ability of Polish to form long compounds. Dissecting the "Bezbarwne zielone idee wściekle śpią" example from https://arxiv.org/pdf/1810.10222.pdf#page=3 reveals no words longer than 4 morphemes:

bez-BARW-n-e ZIEL-on-e IDE-e WŚCIEK-l-e ŚP-ią, where I put word roots in uppercase and bound morphemes in lowercase.

The longest sequences of morphemes (for a loose definition of morpheme) I can think of are conditional mood of verbs with double prefixes like po-wy-CHODZI-ł-y-by-ście. However, the sequences of bound morphemes in those forms, which may look complex to you, form a finite-state language that admits just a few sequences.

It's not about the number of letters in the compounds, but about the number of morphemes.

Your "powychodziłybyście" example could be translated as "you (feminine, plural) would have been going out". With the word tokenization, you get (ignoring comma and brackets) 8 tokens in English and one token in Polish. Now you can have three persons, two genders, two numbers, an imperfective or perfective verb, etc. resulting in combinatorial growth of word tokens in Polish. If you have all word forms for "go out" and you want to add "go in", in English you would add a single token "in", and in Polish you add all the tokens with "-wy-" replaced by "-w-". As a result in Polish you end up with much bigger vocabulary. Additionally you need bigger training corpus as you cannot learn the tokens independently. For example, if you know the meaning of "he ate" and "she wrote", you should be able to guess the meaning of "he wrote", as you've seen all of the tokens. In Polish it's "Zjadł", "Napisała" and "Napisał" - all of the word tokens are different.

Using the subword tokenization instead of word-level tokenization is kind of similar to using a normalized database instead of unnormalized one. It's not about one form being more complex than the other as they're equivalent. After all, will written English be much more complex if we remove all whitespaces? :)

I agree with what you wrote. I did not object to subword tokenization that let you(?) win the competition. I objected to GP's assertion that one can add many morphemes together to create very long "words" in Polish, which made casual readers think of stringing morphemes like German compounds while the number of morphemes in Polish words is bounded by 7, maybe by 8.
Both the primary authors are Polish, and they won the competition, so I don't really have any informed view to add...

Maybe best to mention Turkish in the book!

Or Hungarian.

Megszentségteleníthetetlenségeskedéseitekért for example.

Yes, which is an agglutinative language
Doesn't German work this way?
Not really. "German grammar allows for the construction of long compounded noun phrases which are expressed as one word in written language. Compounding is not really the same as agglutination.": https://www.quora.com/Is-German-considered-a-true-agglutinat...

There are quite a lot of languages that do though: https://en.wikipedia.org/wiki/Agglutinative_language

Not just in written language, although the difference between a “word” and “noun phrase” in spoken language is in the ear of the beholder.

But in a linguistic sense indeed, German is not at all an agglutinative language.

English too. Policeman, bathwater, catwalk, headstone, toothbrush, etc.