Hacker News new | ask | show | jobs
by hprotagonist 3182 days ago
certainly the relative paucity of some subject matter to train on is a factor.

The other factor is that complex use of natural languages depend on and require ambiguity and layered meaning and hosts of other factors that are really hard to handle.

1 comments

I'm not sure love letters are more complex than the news. It often turns out we're not as complex as we think. Or that doing simple calculations on very large amounts of data captures all those nuances. That was the result described in "The Unreasonable Effectiveness of Data".
there's (far) more information in-band in a snippet of language than there is in the "plain" "meaning" of that snippet. All of it's relevant for translation, and none of that is easily tractable.

A fairly trivial example is a pun. Translating highly idiomatic things of this nature turns out to be extraordinarily hard, and just throwing more data at a DNN is not going to get you too far down that road.

That depends on how often that pun appears in the corpus. If you observe the translation enough times, it'll be easy for the computer.
this misses the point. There do exist untranslatable idioms, and puns are an easy example. They rely on language-specific features like rhymes (or sight-rhyme) or homophones that are not preserved across corpora.

You can absolutely make a "direct" word-for-word translation of a pun in english to, say, russian. It's just not a pun any more when you're done. Often there are no "pun equivalents" with totally different words, because usually they hinge on culturally specific references that also don't translate well.

Basically none of this matters when what you're interested in is subway directions or ordering food or whatever, but it becomes intractable really fast whenever you're interested in talking about something more meaningful.

Au contraire, it doesn't matter if the pun is translated directly. Heck, Google might be doing whole paragraph translation for all we know. It's certainly not at the level of individual words.

Translate words and it's gibberish. Pairs of words and you start to get slang. Triples and you can distinguish word that are different parts of speech in different contexts. Quads and grammar is mostly in the bag. 5-grams and most puns are handled. 6-grams and you've taken care of all simple sentences. Etc.

No need for semantics when n-gram counts does just as well.

With enough people talking, we'll eventually have taught Google all the translations for all possible sentences. (joking, but only halfway)

>it doesn't matter if the pun is translated directly.

you can't do this anyway.

>No need for semantics when n-gram counts does just as well.

They don't. neither do bag of words, word2vec, or whatever.

Simple imperative language? Absolutely, this all works pretty well. Anything else? Ha.

Love letters rely more on emotion, allegory, and subtext than the news, especially modern-era news.
Does that make translation more random? If not, then the real trouble is lack of data.