| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nopinsight 2428 days ago

As someone working in the field, I congratulate the excellent accomplishment but agree with the authors that we shouldn't get too excited yet (their quote below after the four reasons). Here are some reasons:

1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here: http://www.ericswallace.com/triggers

2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.

3) Most or all of the tests are multiple-choice. Learning complex correlations from sufficient data should help solve most of them. This is useful but human-level understanding is more than correlations.

4) The performance on datasets that require commonsense knowledge, COPA and WSC, are the weakest relative to humans (who score 100.0 on both).

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, p.32 https://arxiv.org/pdf/1910.10683.pdf

"Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting."

I’d like to emphasize that the work and the paper are excellent. Still, we are quite far from human-level language understanding.

---

We may need more advanced tests to probe the actual language understanding ability of AI systems. Here are some ideas:

* Test for conceptual understanding in a non-multiple-choice format. Example: Write a summary for a New Yorker article, rather than standard news pieces (which tend to follow repeated patterns).

* Commonsense test with longer chains of inference than those needed for solving Winograd Schema and set in non-standard situations (e.g. fantasy world). This should greatly reduce the chance that an approach can simply detect correlations from huge datasets.

* Understanding novel, creative metaphors like those used in some essays by professional writers or some of the Economist's title articles.

3 comments

YeGoblynQueenne 2428 days ago

I think that the point about the majority of tests being multiple-choice is the most important one to underline.

Structuring a problem as a multiple choice task is basically turning it into a classification problem, but it doesn't really answer the question everyone wants answered: is it really possible to reduce the problem of language understanding to classification? i.e. is it really possible to understand human language with no other ability than the ability to identify the classes of objects?

But that is a question that has to be answered before any performance on benchmarks that reduce language understanding to classification can be appraised correctly. If accurate classification is not sufficient for language understanding, then beating benchmarks like SuperGLUE tells us nothing new (we already know we have good classifiers).

The problem here is that we have no good measures of language understanding, of humans or machines- because we have a poor, er, understanding of our own language ability. Until we know more about what it means to understand language it won't be possible to evaluate automated language understanding systems very well.

Hopefully though, the skepticism I've observed around results like the one above, will lead to a renewed effort to research our language ability, and perhaps our intelligence in general.

link

VikingCoder 2428 days ago

> 2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.

...but, humans evolved the ability to use language over hundreds of generations... So... Maybe that's not such a bad thing?

link

msamwald 2428 days ago

Indeed this is important to realize: Training such a generic model from scratch does not only reiterate learning, but the entire evolutionary process that led to the emergence of neural circuits actually capable of such learning. That perspective makes many of the current achievements -- error-prone as they might be -- even more impressive!

link

nopinsight 2428 days ago

The amount of data required may not be a decisive factor but rather a canary in the coal mine that something is off.

If we wish to use a model in critical situations, such as a medical setting or commanding a self-driving car, 1) and 4) above cannot be ignored.

link

wongarsu 2428 days ago

> 1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here

Humans are susceptible to adversarial triggers too, so this doesn't necessarily make the model less impressive. It is a big problem in practical use though.

link

nopinsight 2428 days ago

I am curious on what you mean by adversarial examples/triggers for humans in the domain of natural language.

Off the top of my head, I can think of:

* garden path sentences

* highly recursive sentences

Could you or anyone provide some other classes?

The two classes above however can generally be understood by a large number of educated native speakers with time to think carefully.

Humans also do not get derailed so badly as in the examples in this link. http://www.ericswallace.com/triggers

link

wongarsu 2428 days ago

I don't think universal triggers exist, since at that point they are just language features. But there are plenty of less universal triggers

Let's imagine that that in the brain everything goes through a series of models, first tokenization into words, then we build something like an abstract syntax tree, then we analyse meaning in the context etc; and each time one of these steps reaches a nonsensical result we start over with additional parsing time allocated. It's probably not true, but close enough to be a useful model.

Now what you consider an adversarial example depends on how far down the stack it has to go until it's caught:

- "The old man the boat." fails in the early parsing steps. We reliably miscategorize old as adjective when it's a noun.

- "More people have been to Russia than I have, said Escher" goes a step further, it parses just fine but makes no sense. The tricky thing is that you might initially not notice that it makes no sense. This is about the level where AI is today.

- "Time flies like an arrow; fruit flies like a banana" makes perfect sense, but you could notice that the straight forward way to parse it leads to a non-sequitur and parsing it as "time-flies love eating arrows; fruit-flies love eating bananas" is probably a better way to parse it.

Of course that's just the parsing steps. You can trick human "sentiment analysis" by swapping words without changing the meaning. Compare "this bag is made from fake leather" to "this bag is made from vegan leather". PR and marketing have made a science out of how to make bad things sound good. Similarly PR is great at finding adversarial examples for reading comprehension, where they say one thing that's nearly universally understood to mean something different (or to mean nothing at all; or where something that seems to mean nothing at all actually means something very siginicant).

Of course we assume all text to be targeted to humans; so if something is widely misunderstood by humans we blame the sender for writing such a bad message; when it's widely misunderstood by AI we blame the AI for being so bad at reading.

link