| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fxtentacle 1655 days ago
	This is a great result, but you can see that it's more of a theoretical case because of this: "converging to perfect play as available computation time and approximation capacity increases." That is true for pretty much all current deep reinforcement learning algorithms. The practical question is: How much computation do you need to get useful results? Alpha Go Zero is impressive mathematics, but who is willing to spend $1mio daily for months to train it? IMPALA (another Google one) can learn almost all Atari games, but you need a head node with 256 TPU cores and 1000+ evaluation workers to replicate the timings from the paper.

2 comments

sillysaurusx 1655 days ago

You often don't need anywhere near the amount of compute in these papers to get similar performance.

Suppose you're a business that needs to play games. Most people seem to think that it's a matter of plugging in the settings from the paper, buying the same hardware, then clicking a button and waiting.

It's not. The specific settings matter a lot.

But my main point is that you'll get most of your performance pretty rapidly. The only reason to leave it running for so long is to get that last N%, which is nice for benchmarks but not for business.

DeepMind overspends. Actually, they don't; they're not paying anywhere close to the price of a 256 core TPU. (Many external companies aren't, either, and you can get a good deal by negotiating with the Cloud TPU team.)

But you don't need a 256 core TPU. Lots of times, these algorithms simply do not require the amount of compute that people throw at the problem.

On the other hand, you can also usually get access to that kind of compute. A 256 core TPU isn't beyond reach. I'm pretty sure I could create one right now. It's free, thanks to TFRC, and you yourself can apply (and be approved). I was. https://sites.research.google/trc/

It kills me that it's so hard to replicate these papers, which is most of the motivation for my comment here. Ultimately, you're right: "How much compute?" is a big unknown. But the lower bound is much lower than most people realize (and most researchers).

link

fxtentacle 1655 days ago

My personal experience was the opposite. I'm currently trying different approaches for building a Bomberman AI for the Bomberland competition that was discussed here on HN a few weeks ago.

"IMPALA with 1 learner takes only around 10 hours to reach the same performance that A3C approaches after 7.5 days." says the paper, but I can run A3C on a cheap CPU-only server but to get that IMPALA timing, I need to spend a lot of money. But my biggest roadblock so far is that I need compute far exceeding what the papers claim.

The diagrams for IMPALA show good performance starting at 1e8 environment frames and excellent performance at 1e9 frames. By now, I'm at 2.5e9 frames and performance is still bad. In my opinion, the reason is that the sequence lengths for Bomberland are quite long. To clear a path, you place a bomb, wait 5 ticks for it to become detonatable, then detonate it, then wait 10 ticks for the fire to clear. With 7 possible actions per tick, the chance of randomly executing this 17 tick sequence becomes (1/7)^17 = 4e-15. If I calculate optimistically that all moves are valid, too, while we wait, then I can get up to (1/7)(5/7)^5(1/7)*(5/7)^10 = 1e-4. But that still means that at 1e8 env steps, I only have 1000 successful executions to learn from.

link

Javantea_ 1654 days ago

I don't have a lot of experience with IMPALA, but the sequence of events you describe should be very easy for an end-to-end system. Assuming you don't have an end to end system, just getting a gradient would result in rapid learning of that sequence. I'm surprised that at 2.5e9 frames you're not done. Perhaps there is a hyperparameter issue. Sorry I can't help but it sounds like you are in the same place I am with ML project. Good luck.

link

iwd 1655 days ago

Not an expert, but I believe many papers on other video games make a single decision for the next X frames at once, possibly including a delay factor that governs exactly when to act. I think OpenAI’s Dota2 agent does this.

link

fxtentacle 1655 days ago

I have experimented with that, too, but in my case it also multiplies the number of potential actions. If I have 7 actions per timestep, grouping them into 3-timestep blocks means I now have 777 = 343 possibilities to choose from.

From what I understand, the OpenAI Dota 2 AI has a long-term strategy module which was mostly trained by imitating 60,000+ replays played by human professional teams. My problem with doing that for the Borderland competition is that I don't have any data source for replays of someone playing the game really well. You control 3 units simultaneously and it's 2 teams against each other, so I'd need 6 dedicated volunteers playing the game for many hours to create a reasonably-sized corpus of human replays. And who says that those people are good at it?

link

ericd 1654 days ago

Hm not an expert in this, but would something with a world model help, rather than depending on stochastic random action choices? It seems like it should be possible to learn that a frame sequence where you've been next to a bomb for 6 ticks is rapidly decreasing your expected score, and that your score would be significantly better if you weren't in line with the bomb pretty soon.

link

fxtentacle 1654 days ago

I'm in the process of attempting just that, with limited success. In my case, I trained a classifier that takes the current surroundings of the player unit and tries to predict that we'll gain an advantage in this segment of the game. I split the game into segments based on when the HP relationships between teams change. And gaining an advantage then means that you take more HP from the enemy team than what you and your teammates lost.

The classifier has on average 90% accuracy which seems good. I then use the likelihood predicted by this classifier to compute the weight with which I want to train each action and if I want to train it positively (by pulling its likelihood of being chosen up) or negatively (pushing the likelihood of that action down).

However, what this model cannot correctly represent is the fact that whether or not a given situation will turn out to be good or bad in the long term is highly dependent on how you play. So if I train this with replay data, I will score the situations in relation to how well those (outdated) AIs could take advantage of them.

Next up, I'll try to fix this issue by introducing a graph-like stochastic structure. The basic idea is that I encode "from this state S if I take action A, then I can reach state T with P percent likelihood" into yet another neural network. If I then identify a state which is really beneficial in the sense that I can reliably convert it into an advantage, then I can use this graph to back-propagate that knowledge so that I get "from this state S, action A takes me to state T, then action B takes me to state U, and U is great".

That should allow me to train with historical data to identify which transitions are possible, and then I can combine that with realtime data about the desirability of each state. So basically I'd do A* pathfinding over the graph of possible states to identify which actions are needed to bring me from my current situation into the closest "I will surely win" situation. Except that the graph is memorized by an AI because the real state-space is huge: 15x15 fields with 6 units + 5 environment states => roughly 11^(15*15) states

link

ericd 1654 days ago

Ah yeah the hp thing sounds a bit like what the OpenAI dota team did with their team scoring differential thing.

I’ve not built what you’re describing, just read related research papers, so I can’t really evaluate your plan, but I can wish you good luck!

link

loxias 1655 days ago

My thoughts, not being in the field, are parallel to the parent post. "It's nice and all that we're achieving better and better computer performance at things that used to require the human brain, but it seems we're doing so by building larger and larger computers."Not to detract from that achievement, I love large computers in their own right!

I'm a dabbler in Go, and "somewhere below professional" at the game of poker. I've followed the advances in the latter for more than a decade, eagerly reading every paper the CPRG publishes. They use a LOT of compute power!

I know from experience that "The specific settings matter a lot.". For several years, I made my living "implementing papers for hire". It's real work, no argument there. Sometimes the settings are the solution, and heck, sometimes the published algorithm is outright wrong, and you only discover so when trying to implement it.

But the second part of your point, that it's not simply achieving more performance by throwing more transistors at it, I don't have experience with, and I sorta don't believe you. :)

Your comment is quite well written, making me (irrationally?) predisposed to suspect you're correct on factual matters, or at least more of a domain expert than I. Can you cite sources, or simply elaborate more?

link

fault1 1655 days ago

> "The specific settings matter a lot.".

Yes, and in the case of deep RL, the ability to to get "lucky" random initialization seems to (still) matter a lot.

I work in real time control systems, which are roughly decision making under uncertainty problems. A lot of the RL research has become noise buoyed with large marketing budgets.

link

gwern 1654 days ago

> That is true for pretty much all current deep reinforcement learning algorithms.

Is that true? I was unaware that PPO, SAC, DQN, Impala, MuZero/AlphaZero etc would all automatically Just Work™ for hidden information games. Straight MCTS-inspired algorithms seem like they'd fail for reasons discussed in the paper, and while PPO/Impala work reasonably well in DoTA2/SC2, it's not obvious they'd converge to perfect play.

link

fxtentacle 1654 days ago

You can mathematically prove for a lot of different algorithms (including PPO, DQN, IMPALA) that given enough experience with the game world, they will eventually converge to the optimal policy. It's just that the "enough experience" part might be so large that it's practically useless.

If I remember correctly, the DeepMind x UCL RL Lecture Series proves the underlying Bellman equation in this video: https://www.youtube.com/watch?v=zSOMeug_i_M

As for "hidden information" games, I thought the trick was to concatenate the current state with all past states and treat that as the new state, thereby making it an MDP.

link

gwern 1653 days ago

I don't think you can prove that (forgive me if I don't sit through a 2h video). Those all are susceptible to the deadly triad, and AFAIK there are no convergence proofs of any kind for the big model-free DL algs, and it would've been big news if someone had proved that a real-world version of PPO/DQN/IMPALA does in fact converge in the limit. Sutton's book and earlier proofs only cover cases where you drop the nonlinear approximator or something.

(History stacking may turn POMDPs into MDPs, but I don't know if they handle the specially adversarial nature of games like poker. That's quite different from stacking ALE frames.)

link

cygaril 1652 days ago

Standard RL algorithms will converge to optimal play versus a fixed opponent, but will not find an optimal policy via self play.

One intuitive way to see this is that a sequence of improving pure policies A < B < C < etc. will converge to optimal play in a perfect information game like chess, but not necessarily in an imperfect information game like rock/paper/scissors where Rock < Paper < Scissors < Rock, etc

link