Hacker News new | ask | show | jobs
by rhesa 1477 days ago
My litmus test for translation services is to see if the ambiguity between "turkey" (the bird) and "Turkey" (the country) manifests itself. Many European languages have two separate words for the two concepts that can't possibly be confused, so I think this is a good unit test.

My Dutch test sentence is: "de kalkoen bezoekt Turkije" (the bird visits the country). Most of them get it wrong for most European languages:

google -> german: "die türkei besucht die türkei"

bing -> swedish: "kalkonen besöker kalkon"

deepl -> hungarian: "a pulyka meglátogatja a pulykát"

translate.com -> french: "la dinde visite la dinde"

lingvanex.com -> spanish: "el pavo visita pavo"

Only one gets it right:

systran.net -> french: "La dinde visite la Turquie"

5 comments

Really interesting.

I tried out "Hey, how are you? I don't understand how it can be so warm today." in my native language, and systran is the only one that got it 100% correct. Google was close, but reversed an article and a noun. The others mixed up "don't understand" with "don't know", which are similar but different enough to sound unnatural.

I've always thought that the best way to assess these systems is how they handle colloquial speech, stuff that we often take for granted but that's really quite strange to translate "literally". I bet even that phrase -- "take for granted" -- would be difficult to translate even though I'm certain most languages have a phrase for that exact sentiment.

I've put this exact comment and asked for a Korean translation:

> 정말 흥미 롭습니다.

Almost correct, but "흥미롭습니다" (lit. be interesting with an implied to me) should be a single word.

> "이봐, 잘 지내? 오늘 어떻게 따뜻해질 수 있을지 모르겠습니다. "모국어로 systran이 100 %을 얻은 유일한 언어입니다. 구글은 가까웠지만 기사와 명사를 뒤집었다. 다른 사람들은 "알지 못한다"와 "알지 못한다"를 섞었다.

"I tried out" is completely missing, and "in my native language" is joined to the next sentence "and systran is...".

The quoted sentence is, when translated back to English, something like "Hey, are you going well? I don't know how [something] can get warm today." Like most other machine translators LingvaNex is clueless about Korean honorifics (the first sentence is informal while the second is formal here). It does get a colloquial Korean expression for "I don't understand" (lit. 이해가 안 된다) but doesn't get the dummy pronoun, so it somehow assumes an unspecified entity as a subject. The position of the closing quote is also off.

The first quasi-sentence after that quotation became something like "[it] is the only language that systran got 100 % in a native tongue". Even after ignoring the inconnectly joined "in a native tongue", the dummy pronoun seems a culprit again here where "the only one" got interpreted as a language, not systran.

The next sentence reads like "Google was nearby but swapped a post with a noun." A Korean word 가깝다 (lit. nearby) has a slightly different nuance from English "close" so it has to be paraphrased. LingvaNex interpreted an article as, uh, a newspaper article which is not a synonym in Korean (the correct word should be 관사). And it somehow also switched back to informal expressions.

The final sentence reads like "other people mixed 'don't know' and 'don't know'." This is kinda hilarious; LingvaNex actually understands both expressions are more or less equivalent in Korean but doesn't know when they have to be distinct.

> 나는 항상 이러한 시스템을 평가하는 가장 좋은 방법은 구어체 연설을 처리하는 방법, 우리가 종종 당연하게 여기는 것이지만 "문자 그대로"번역하는 것은 정말 이상하다고 생각했습니다". 나는 대부분의 언어가 그 정확한 감정에 대한 문구를 가지고 있다고 확신하지만 그 단어 ( "당연한 것으로 받아들이십시오")조차도 번역하기가 어려울 것입니다.

The first half is so hopelessly mangled that I can't give an English equivalent. I mean, each part is reasonably translated (including the phrase "stuff that we often take for granted" which translation is pretty much correct) but the wrong ordering messes everything up.

The second half is more reasonable: "I'm confident that most languages have phrases for the exact feeling but even that word ('take it to be natural') would be hard to translate." What is "the exact feeling" is unclear due to the reordering, and "take it granted" got translated too literally, but otherwise sounds fine.

I remember reading years ago about one of the more amusing machine translation failures. The software translated "The spirit is willing, but the flesh is weak." into Russian, then translated the result from Russian back into English, and the result was "The vodka is good, but the meat is spoiled."

I tried the same test with Google Translate just now and the result produced was "The spirit wants, but the flesh is weak." Not bad.

My favorite example:

"Yesterday I went to the beach on Long Island. The sound was beautiful."

versus

"Yesterday I went to the symphony. The sound was beautiful."

The trick of maintaining the context of "sound" is difficult.

Even for me as a non-native English speaker and a non-AI bot (allegedly), I parsed the 'sound' in the former sentence wrong at first.
Turkey recently changed the spelling of their name / country for this reason.
i get "der Truthahn besucht die Türkei" with google.
Including the correct capitalization of nouns? Impressive!

I did a side-by-side test with a coworker, looking at the results in various European languages. We did get the same answers, and some languages did get the correct translation (Ukranian, Polish, Czech, but not Slovak), but that you get a different translation than I did is weird.

Yes, that is the 1:1 copy. I don't know German capitalisation rules. Was on mobile though.

Here's an interesting thing that might be due to their ML or whatever:

Try:

- de kalkoen bezoekt Turkije

- De kalkoen bezoekt Turkije

- de kalkoen bezoekt Turkije.

- De kalkoen bezoekt turkije.

- De kalkoen bezoekt turkije[space] with an extra trailing space

I also get this with deepl