| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gdb 2917 days ago

The best way we know to think of it is in terms of variance of the gradient.

In a hard environment, your gradients will be very noisy — but effectively no more than linear in the duration you are optimizing over, provided that you have a reasonable solution for exploration. As you scale your batch size, you can decrease your variance linearly. So you can use good ol' gradient descent if you can scale up linearly in the hardness of the problem.

This is a handwavy argument admittedly, but seems to match what we are seeing in practice.

Simulators are nice because it is possible to take lots of samples from them — but there's a limit to how many samples can be taken from the real world. In order to decrease the number of samples needed from the environment, we expect that ideas related to model-based RL — where you spend a huge number of neural network flops to learn a model of the environment — will be the way to go. As a community, we are just starting to get fast enough computers to test out ideas there.

4 comments

shawn 2917 days ago

Yo, this probably isn't the type of HN comment you're used to, but I just wanted to say thanks for enriching the dota community. I know that's not really why you're doing any of this, but as someone who's deeply involved with the community, people get super hyped about what you guys have been doing.

They also understand all of the nuances, similar to HN. Last year when you guys beat Arteezy, everyone grokked that 5v5 was a completely different and immensely difficult problem in comparison. There's a lot of talent floating around /r/dota2, amidst all the memes and silliness. And for whatever reason, the community loves programming stories, so people really listen and pay attention.

https://imgur.com/Lh29WuC

So yeah, we're all rooting for you. Regardless of how it turns out this year, it's one of the coolest things to happen to the dota 2 scene period! Many of us grew up with the game, so it's wild to see our little mod suddenly be a decisive factor in the battle for worldwide AI dominance.

Also 1v1 me scrub

link

oblio 2917 days ago

Agreed! Can't wait to not have to play Dota 2 with humans :p

link

Ntrails 2916 days ago

> Also 1v1 me scrub

I wanted to play SF against the bot so badly - even knowing I'd get absolutely destroyed over and over agin

link

ilyasut 2916 days ago

EDIT (I work at OpenAI and wrote the statement about the variance of the gradient being linear): Here's a more precise statement: the variance is exponential in the "difficulty" of the exploration problem. The harder the exploration, the worse is the gradient. So while it is correct that things become easy if you assume that exploration is easy, the more correct way of interpreting our result is that the combination of self play and our shaped reward made the gradient variance manageable at the scale of the compute that we've use.

link

defen 2916 days ago

> In order to decrease the number of samples needed from the environment, we expect that ideas related to model-based RL — where you spend a huge number of neural network flops to learn a model of the environment — will be the way to go.

Will those models be introspectible / transferrable? One thing I'm curious about is how AI's learn about novel actions / scenarios which are "fatal" in the real world? Humans generally spend a lot of time being taught these things (rather than finding out for themselves obviously) and eventually come up with a fairly good set of rules about how not to die in stupid ways.

link

wassyape 2915 days ago

Transferability depends on the way the models is set up, and moves on a scale.

Introspectable: given that you can ask unlimited "What if" questions models, we should be able to get a lot of insights into how the models work internally. And you can often design them to be introspectable as some performance or complexity cost. (if that's what you meant by introspectable).

link

obastani 2917 days ago

Can you clarify why variance only scales linearly in the duration you are optimizing over? I would have expected it to be exponential, since the size of the space you are searching is exponential in the duration.

link

ilyasut 2917 days ago

Re variance, the argument is not entirely bullet proof, but it goes like this: we know that the variance of the gradient of ES grows linearly with the dimensionality of the action space. Therefore, the variance of the policy gradient (before backprop through the neural net) should similarly be linear in the dimensionality of the combined action space, which is linear in the time horizon. And since backprop through a well-scaled neural net doesn't change the gradient norm too much, the absolute gradient variance of the policy gradient should be linear in time horizon also.

This argument is likely accurate in the case where exploration is adequately addressed (for example, with a well chosen reward function, self play, or some kind of an exploration bonus). However, if exploration is truly hard, then it may be possible for the variance of the gradient to be huge relative to the norm of the gradient (which would be exponentially small), even though the absolute variance of the gradient is still linear in the time horizon.

link

ilyasut 2916 days ago

Correction: see https://news.ycombinator.com/item?id=17394916

link

obastani 2916 days ago

That makes sense, thanks for clarifying!

link