| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by paulsutter 3298 days ago

This relates to the big Twitter uproar over this blog post:

https://medium.com/@yoav.goldberg/an-adversarial-review-of-a...

And here's the meat of his response:

> Idea! Let's go back to toy problems where we can create the test conditions easily, like the rationalists did back then (even if we don't realize we are imitating them). After all, Atari is not real life, but it still demonstrates remarkable RL progress. Let's make the Ataris of natural language!

> But now the rationalists converted to empiricism (with the extra enthusiasm of the convert) complain bitterly. Not fair, Atari is not real life!

> Of course it is not. But neither is PTB, nor any of the standard empiricist tasks, which try strenuously to imitate wild language

2 comments

YeGoblynQueenne 3298 days ago

My reading is that Pereira doesn't think that deep learning has quite conquered language, and in this he's in complete disagreement both with Goldberg and Le Cunn's side (who both champion deep learning for NLP and claim that it has led to great advances in the field).

For me the problem with NLP and deep learning, or indeeed any empirical method, is that the evaluation metrics we have are imperfect. Take BLEU scores, from Goldberg's post, for instance. Those basically compare generated text to some arbitrary target. Originally, they were proposed as metrics of machine translation quality, so the target was some existing translation and the machine-generated translation was examined for coverage of this human-made translation. But of course, there is no principled way that we know of to choose one translation over another- or even say whether a translation is a good or bad translation, on its own. And that's true for translations by humans also. You give the same text to 10 professional translators, they'll give you 10 different translations. Then you give each of their translations to 10 readers and ask them for their opinion, and you get back 100 different opinions.

The translation task itself is not even particularly well defined, exactly because there may be any number of valid translations (possibly, infinitely many) of a piece of text in another language. So, with translation, we have an ill-defined task with an arbitrary metric. And that metric of course is lifted from its original task and used to evaluate language generation and so on. Then someone comes along who knows how to train a deep net but has no idea what the purpose of their chosen metric is, or what it does and has no understanding of the task itself- and claims to have solved it because they got good results on that metric.

It's a bit of a methodological mess that's not going to lead to much progress. People can keep piling on these "results" for as long as they like and pretend that they're "solving" this or that problem- but in real-world terms, nothing is really being solved at all.

link

unhammer 3298 days ago

Bit of an aside: Apparantly ChrF – character-level n-gram f-score – is the new hotness in evaluating MT systems http://www.aclweb.org/anthology/W/W15/W15-30.pdf#page=412

link

YeGoblynQueenne 3298 days ago

OK, but this still has the same problem as BLEU- it relies on comparisons to human scores, which are entirely subjective. I'm not saying they're not the best we got, but it's a big problem for machine translation that the only way to evaluate results is, essentially, comparing it to eyballing.

link

paulsutter 3298 days ago

Google translate is now based on a neural network and you can be sure they have solid metrics. By analogy Google search has a large panel of humans whose subjective feedback is used to test the quality of search algorithm variations.

link

YeGoblynQueenne 3298 days ago

This is something that needs to be repeated until everyone internalises it: for language pairs other than the "easy" ones Google translate sucks.

I am Greek and translations from and to my language are utterly ridiculous, on the level of Bozo the clown doing the translation with his underpants on his head back to front.

Typical example: I put in the Greek word for "swallow", the bird, and ask for the French translation. I get back the word "avaler" - the French word for "to swallow", the verb.

That's my little benchmark there, useful because Google translate has been doing this consistently, for a good few years, before it used neural networks, before it started claiming its setup essentially constitutes an "interlingua" etc etc.

Note that the bird and the verb sound nothing like each other in Greek, or French. They sound the same only in English, so GT goes from Greek to French through English. Because it doesn't have enough parallel texts to go directly to French. And so it sucks, because it doesn't have enough data. You can ask native users of other languages-that-are not-English or have few ish speakers, perhaps Turkish or Hungarian etc. I'm pretty sure you'll find out they have similar experiences.

So I don't know what metric they use to evaluate their results, it doesn't seem to be a particularly good metric of translation quality. Maybe they just care more about how many people use their system and try to optimise for that, rather than going for the much harder to know quality.

link

ajuc 3297 days ago

I'm Polish. I google translate even from Slavic languages that are very close to Polish (Ukrainian, Slovak - it's like 50% understandable without translation) to English not to Polish, because X -> Polish google translation sucks.

link

YeGoblynQueenne 3298 days ago

>> you can be sure they have solid metrics.

Btw- no, I can't be sure of that. Why do you say I can? Do you know what metrics they use?

link

empath75 3298 days ago

But deep learning networks are being used in production every day aren't they?

link

putnam 3298 days ago

Yes

link

naturalgradient 3298 days ago

Leading somewhat off-topic, but this has also sparked a rather frank debate on r/machinelearning about some of the things discussed in the review, in particular arxiv flag painting:

https://www.reddit.com/r/MachineLearning/comments/6gke6a/d_r...

link