Hacker News new | ask | show | jobs
by aledalgrande 1857 days ago
Content of the article:

- 1000 times more powerful than BERT, but still transformer architecture

- trained on 75+ languages, can transfer knowledge between languages

- can do text and images (not audio and video yet)

- can understand context, go deeper in a topic and generate content

Not much apart from their words about how amazing it is. Paper? Demo?

3 comments

Lol, they state that their model is a 1000 times more powerful than BERT? Under what metric?
According to my understanding they are referring to parameter count. If we go by that logic, BERT has 340M parameters. GPT3 has 175B. So this will have 340B parameters?
That's what I was wondering! Such gibberish
Well so far the're mostly talking about what it would be able to do, so it's probably more wishful thinking than any exact metric.
> trained on 75+ languages, can transfer knowledge between languages

There is zero possibility that Google accomplished proper "language transfer" with the vast majority of Silicon Valley programmers being native English speakers.

In some languages, if you accidentally use a wrong single syllable in any sentence, you can end up saying something extremely embarrassing--and entirely different. This is the case with many Slavic languages.

This is a memorable "classic" [1]:

> "Tony Henry belted out a version of the Croat[ian] [national] anthem before the 80,000 crowd, but made a blunder at the end. He should have sung 'Mila kuda si planina' (which roughly means 'You know my dear how we love your mountains'). But he instead sang 'Mila kura si planina' which can be interpreted as 'My dear, my penis is a mountain'."

Many languages are much more grammatically complex than English, and also have an unbelievable amount of implicit contextual information derived from the grammatical morphology. For example, Slavic languages tend to be this way. The Slavic language that I speak, Croatian, tends to be very clean, direct, and concise, while being extremely complicated grammatically. Also, we have a lot of the same words for the same thing in Croatian, which in combination with the complicated grammar, it makes it a very expressive language. English, however, can be more expressive, in the sense that it allows for more figurative language, like with the usage of idioms.

[1] BBC: Anthem gaffe 'lifted Croatia': http://news.bbc.co.uk/sport2/hi/football/7109058.stm

Modern NLP architectures do not explicitly model language structure. Even in English, the model isn't directly told anything about about how words work. So the native language of the human authors of the model is (in principle) irrelevant to how effective the system is.
> There is zero possibility that Google accomplished proper "language transfer" with the vast majority of Silicon Valley programmers being native English speakers.

This speaks to ignorance of who Google employs. A ton of the engineers are immigrants there. When I was on Google Photos in MTV, I'd estimate it being about evenly split between native, English-first speakers, vs people who were either non-native English speakers or grew up with two languages simultaneously (children of first gen immigrants in the US).

Silicon Valley has a huge amount of cultural and ethnic diversity, so I don't know why you would make this mistake.

> There is zero possibility that Google accomplished proper "language transfer" with the vast majority of Silicon Valley programmers being native English speakers.

I don't know the people who worked at this project, but you do realise that Google employs swaths of programmers that are not native English speakers?

There is nothing here but a promise. Back in the day we called this "vaporware".
I don't think it's vaporware but the blog post with all these big claims like 1000 more powerful than BERT (based on our arbitrary cherry picked metric) makes one cringe.

Here's my guess: Some team under web search trained a large Transformer based model but with some adjustment here but now on a massive dataset from the crawled web pages using tons of TPUs. It made an incremental improvement to the search quality metrics and was shipped to production.

We sort of already know that these models scale in such a way that a model with 1000 times the parameters is, indeed, 1000 times more powerful. We haven't found a ceiling effect yet, so the onus is on the skeptics. These things scale.
According to the scaling laws it scales on a log scale.
It's Schrodinger's vaporware. We'll find out some years from now. In Perl 6's case, what, 12 years after the announcement?
Except this is Google not some startup.
Vaporware also happens with established companies.
Like IBM's Watson after Jeopardy.
What vaporware has come out of Google Brain? In fact, they've been publishing ground-breaking research after ground-breaking research that's completely changed the entire field in recent years.
After seeing Alpha* solve Go, Chess, and protein folding in the past ~3 years, I think it would be pretty silly for your prior to be discounting any Google AI project as vaporware.

Their models accomplish ridiculously powerful things. Tbh I think it's far _more_ likely the answer is "this is crazy powerful, but the engineers didn't feel like writing a blog post about it, and the marketing team hasn't figured out how to monetize it yet".

If there's anything SoTA AI researchers love and have experience doing it's writing blog posts and papers explaining how.

The lack of details makes me think they're either hiding a new technique they'd rather keep secret because it provides a competitive advantage, or that it's really only a marginal improvement over existing NLP models (or an ensemble of them with nearly no improvement on any given metric) and the 1000x improvement is on a metric that no actual ML scientist would respect.

I don't have the slightest bit of information about Google's AI team to know if those are the only two options and if so which is more likely.

It's not a secret at all. Transformer models scale. Big models are powerful. Everyone knows this. Google can afford to train very big models. It's not a new technique. I think the issue here is that people are uncomfortable with the idea of AI models displaying scale relativity.
Big model also means lots of data, including lots of unfiltered garbage used in training. Nobody can manually review so much data, all they can do is automated filtering at this scale. So this means the model has a large attack surface and it is going to be used to do something bad and shame itself when put together with critics determined to find those gaps.

We have seen in the last few months attacks on Google Translate, GPT-3 and other language models from the PC crowd, including the famous AI Ethics firings. It's just tricky to show it in this climate.

The PC crowd don't believe language is fair and concepts neutral, instead saying they are an expression of systems of power. So language models are a natural target for them because they could amplify biases against their identity groups.

I find this critique hasty especially because big language models are nascent technology. We shouldn't throw away the baby with the bath water!

The PC crowd is right. Language encodes our cultural beliefs, and many of them are pretty rotten. But how do you update a culture's shared set of beliefs? Banning words is a symbolic exercise. What we tend to do instead is that we tell stories and share perspectives. We learn to empathize.

Figuring out how to feed language models with diverse sources of information is a tough challenge, but not impossible. I share Gebru's concern about "stochastic parrots".

I think showing the model would immediately trigger the critics to nitpick it like the famous "He is a doctor. She is a nurse." case, so they just don't show it until they figure out a way to avoid that. Moreover, language models are easy to trick into politically incorrect conversations and porn. AI Dungeon's GPT-3 was writing lots of porn, for example.