Hacker News new | ask | show | jobs
by YeGoblynQueenne 3289 days ago
My reading is that Pereira doesn't think that deep learning has quite conquered language, and in this he's in complete disagreement both with Goldberg and Le Cunn's side (who both champion deep learning for NLP and claim that it has led to great advances in the field).

For me the problem with NLP and deep learning, or indeeed any empirical method, is that the evaluation metrics we have are imperfect. Take BLEU scores, from Goldberg's post, for instance. Those basically compare generated text to some arbitrary target. Originally, they were proposed as metrics of machine translation quality, so the target was some existing translation and the machine-generated translation was examined for coverage of this human-made translation. But of course, there is no principled way that we know of to choose one translation over another- or even say whether a translation is a good or bad translation, on its own. And that's true for translations by humans also. You give the same text to 10 professional translators, they'll give you 10 different translations. Then you give each of their translations to 10 readers and ask them for their opinion, and you get back 100 different opinions.

The translation task itself is not even particularly well defined, exactly because there may be any number of valid translations (possibly, infinitely many) of a piece of text in another language. So, with translation, we have an ill-defined task with an arbitrary metric. And that metric of course is lifted from its original task and used to evaluate language generation and so on. Then someone comes along who knows how to train a deep net but has no idea what the purpose of their chosen metric is, or what it does and has no understanding of the task itself- and claims to have solved it because they got good results on that metric.

It's a bit of a methodological mess that's not going to lead to much progress. People can keep piling on these "results" for as long as they like and pretend that they're "solving" this or that problem- but in real-world terms, nothing is really being solved at all.

3 comments

Bit of an aside: Apparantly ChrF – character-level n-gram f-score – is the new hotness in evaluating MT systems http://www.aclweb.org/anthology/W/W15/W15-30.pdf#page=412
OK, but this still has the same problem as BLEU- it relies on comparisons to human scores, which are entirely subjective. I'm not saying they're not the best we got, but it's a big problem for machine translation that the only way to evaluate results is, essentially, comparing it to eyballing.
Google translate is now based on a neural network and you can be sure they have solid metrics. By analogy Google search has a large panel of humans whose subjective feedback is used to test the quality of search algorithm variations.
This is something that needs to be repeated until everyone internalises it: for language pairs other than the "easy" ones Google translate sucks.

I am Greek and translations from and to my language are utterly ridiculous, on the level of Bozo the clown doing the translation with his underpants on his head back to front.

Typical example: I put in the Greek word for "swallow", the bird, and ask for the French translation. I get back the word "avaler" - the French word for "to swallow", the verb.

That's my little benchmark there, useful because Google translate has been doing this consistently, for a good few years, before it used neural networks, before it started claiming its setup essentially constitutes an "interlingua" etc etc.

Note that the bird and the verb sound nothing like each other in Greek, or French. They sound the same only in English, so GT goes from Greek to French through English. Because it doesn't have enough parallel texts to go directly to French. And so it sucks, because it doesn't have enough data. You can ask native users of other languages-that-are not-English or have few ish speakers, perhaps Turkish or Hungarian etc. I'm pretty sure you'll find out they have similar experiences.

So I don't know what metric they use to evaluate their results, it doesn't seem to be a particularly good metric of translation quality. Maybe they just care more about how many people use their system and try to optimise for that, rather than going for the much harder to know quality.

I'm Polish. I google translate even from Slavic languages that are very close to Polish (Ukrainian, Slovak - it's like 50% understandable without translation) to English not to Polish, because X -> Polish google translation sucks.
>> you can be sure they have solid metrics.

Btw- no, I can't be sure of that. Why do you say I can? Do you know what metrics they use?

But deep learning networks are being used in production every day aren't they?
Yes