Hacker News new | ask | show | jobs
by fxtentacle 1661 days ago
My personal experience was the opposite. I'm currently trying different approaches for building a Bomberman AI for the Bomberland competition that was discussed here on HN a few weeks ago.

"IMPALA with 1 learner takes only around 10 hours to reach the same performance that A3C approaches after 7.5 days." says the paper, but I can run A3C on a cheap CPU-only server but to get that IMPALA timing, I need to spend a lot of money. But my biggest roadblock so far is that I need compute far exceeding what the papers claim.

The diagrams for IMPALA show good performance starting at 1e8 environment frames and excellent performance at 1e9 frames. By now, I'm at 2.5e9 frames and performance is still bad. In my opinion, the reason is that the sequence lengths for Bomberland are quite long. To clear a path, you place a bomb, wait 5 ticks for it to become detonatable, then detonate it, then wait 10 ticks for the fire to clear. With 7 possible actions per tick, the chance of randomly executing this 17 tick sequence becomes (1/7)^17 = 4e-15. If I calculate optimistically that all moves are valid, too, while we wait, then I can get up to (1/7)(5/7)^5(1/7)*(5/7)^10 = 1e-4. But that still means that at 1e8 env steps, I only have 1000 successful executions to learn from.

3 comments

I don't have a lot of experience with IMPALA, but the sequence of events you describe should be very easy for an end-to-end system. Assuming you don't have an end to end system, just getting a gradient would result in rapid learning of that sequence. I'm surprised that at 2.5e9 frames you're not done. Perhaps there is a hyperparameter issue. Sorry I can't help but it sounds like you are in the same place I am with ML project. Good luck.
Not an expert, but I believe many papers on other video games make a single decision for the next X frames at once, possibly including a delay factor that governs exactly when to act. I think OpenAI’s Dota2 agent does this.
I have experimented with that, too, but in my case it also multiplies the number of potential actions. If I have 7 actions per timestep, grouping them into 3-timestep blocks means I now have 777 = 343 possibilities to choose from.

From what I understand, the OpenAI Dota 2 AI has a long-term strategy module which was mostly trained by imitating 60,000+ replays played by human professional teams. My problem with doing that for the Borderland competition is that I don't have any data source for replays of someone playing the game really well. You control 3 units simultaneously and it's 2 teams against each other, so I'd need 6 dedicated volunteers playing the game for many hours to create a reasonably-sized corpus of human replays. And who says that those people are good at it?

Hm not an expert in this, but would something with a world model help, rather than depending on stochastic random action choices? It seems like it should be possible to learn that a frame sequence where you've been next to a bomb for 6 ticks is rapidly decreasing your expected score, and that your score would be significantly better if you weren't in line with the bomb pretty soon.
I'm in the process of attempting just that, with limited success. In my case, I trained a classifier that takes the current surroundings of the player unit and tries to predict that we'll gain an advantage in this segment of the game. I split the game into segments based on when the HP relationships between teams change. And gaining an advantage then means that you take more HP from the enemy team than what you and your teammates lost.

The classifier has on average 90% accuracy which seems good. I then use the likelihood predicted by this classifier to compute the weight with which I want to train each action and if I want to train it positively (by pulling its likelihood of being chosen up) or negatively (pushing the likelihood of that action down).

However, what this model cannot correctly represent is the fact that whether or not a given situation will turn out to be good or bad in the long term is highly dependent on how you play. So if I train this with replay data, I will score the situations in relation to how well those (outdated) AIs could take advantage of them.

Next up, I'll try to fix this issue by introducing a graph-like stochastic structure. The basic idea is that I encode "from this state S if I take action A, then I can reach state T with P percent likelihood" into yet another neural network. If I then identify a state which is really beneficial in the sense that I can reliably convert it into an advantage, then I can use this graph to back-propagate that knowledge so that I get "from this state S, action A takes me to state T, then action B takes me to state U, and U is great".

That should allow me to train with historical data to identify which transitions are possible, and then I can combine that with realtime data about the desirability of each state. So basically I'd do A* pathfinding over the graph of possible states to identify which actions are needed to bring me from my current situation into the closest "I will surely win" situation. Except that the graph is memorized by an AI because the real state-space is huge: 15x15 fields with 6 units + 5 environment states => roughly 11^(15*15) states

Ah yeah the hp thing sounds a bit like what the OpenAI dota team did with their team scoring differential thing.

I’ve not built what you’re describing, just read related research papers, so I can’t really evaluate your plan, but I can wish you good luck!