Hacker News new | ask | show | jobs
by stavros 1479 days ago
I don't think that's how language detection works, they most likely use the frequencies of n-grams to detect language probability. It's still detected as Greek if you change to "Apoulon vesrreaitais", just because it kind of looks the way Greek words look, not because it resembles any specific word.
1 comments

You are wrong. Had it been that simple I would __not__ have suggested that and for whatever reason I find your reply borderline infuriating but I can't pinpoint exactly why that is.

Regardless, here is me, a native speaker, disproving your hypothesis.

I tried the following words in google translate elefantas ailaifantas ailaiphantas elaiphandas elaiphandac.

The suggested detections are ελέφαντας, αιλαιφάντας, αιλαιφάντας, ελαϊφάντας, ελαϊφάντας, however, the translations are elephant, illuminated, illuminated, elephant, elephant respectively. The first is correct. When mapping the roman characters back to greek, there is loss of information, this is seen in the umlaut above iota which makes the pronunciation from ε [e] - like to αϊ [ai̯], and the emphasis denoted via the mark above epsilon (έ).

Notice that all all the words have an edit distance of >=4, a soundex distance of at most 1, and a metaphone distance of at most 1 [1]. The suggested words as I said above are near homophones of the correct word bar a few minor details.

[1] http://www.ripelacunae.net/projects/levenshtein

> for whatever reason I find your reply borderline infuriating but I can't pinpoint exactly why that is.

I guess that says more about you than about my reply. Also, I'm a native speaker as well. That doesn't really have any bearing, my comment above comes from what I know about common implementations of language detection algorithms, not so much from looking at how Google Translate behaves.

And I was honest about how I felt given how you structured it.

It does have a lot of bearing actually. While I am a native speaker, my spelling skills are atrocious as everything is a sequence of sounds in my head more so than a sequence of letters. To get around my spelling issues I frequently use homophones to find the correct spelling of a word which uses soundex or similar algorithms to find the correct word along with character mappings between the two languages.

Regardless, I believe I have proved the hypothesis to not be true.