Hacker News new | ask | show | jobs
by sansnomme 2402 days ago
Turkish is probably strict enough to be used as a programming language. The only downside is that its vocabulary is utterly alien for most speakers of Latin/Anglo-Saxon languages aside from some borrowed words from French and Arabic.
1 comments

It's actually quite a bit easier to learn since it has few false friends with Latin languages. I often thought search engines written by English speakers focused on bags of words can't work very well in Turkish though?
All languages with synthetic morphology (both agglutinative languages, which glue chains of morphemes together, and fusional languages, which inflect morphemes) struggle with the language modelling techniques used for English.

A big issue is that in synthetic languages 'words' are much more 'rare' (because there are more morpheme combinations per word). So if you're building something like a bag-of-words or an ngram model, your input data is likely to be very sparse which translates to poor modelling of the language itself/what words speakers would judge as grammatical.

With agglutinative languages like Turkish, a technique that has been used with considerable success is just considering each morpheme a distinct token, but it has many of the same problems as word-level tokenization. I was looking at a paper recently that claimed to have found a good way to do smoothing so that unseen ngrams could be assigned a non-zero probability in a way that conformed to the rules of the language, but we'll have to see if that can work in practice.

Not only search, but also autocorrect. Turkish autocorrect on iOS is a flaming disaster even after a decade.

Here’s a real (if unlikely) word in Turkish and how this whole agglutination business works: https://twitter.com/languagecrawler/status/62385880386859827...

I don’t blame Apple though - it might actually be just impossible to do Turkish autocorrect in the same way English autocorrect works, because the beginning of the word indicates the actual word but the end indicates everything else (direction, modifiers etc.). So it’s about as easy/hard as English to guess the beginning of the word, but impossible to guess the modifiers that get added because the moment the modifier sequence starts, every single letter starts to change the meaning, thus there are almost no incorrect paths. A correct Turkish autocorrect implementation would autocomplete the word root, but leave at the halfway-compete word at where the modifier suffixes start so that the user can complete the modifier sequence on his / her own.

Seems like you're talking about autocompletion, not autocorrect. In autocorrect you have completed the word, hit space and then the software fixes your typos. In autocomplete you get a list of suggested words while typing and you can tap them if your intended word is shown.
No - while auto completion is also broken, I’m talking about autocorrect. It’s a very common occurrence in Turkish iOS that something that you typed in correctly gets autocorrected to something else that makes no sense.