| You often don't need anywhere near the amount of compute in these papers to get similar performance. Suppose you're a business that needs to play games. Most people seem to think that it's a matter of plugging in the settings from the paper, buying the same hardware, then clicking a button and waiting. It's not. The specific settings matter a lot. But my main point is that you'll get most of your performance pretty rapidly. The only reason to leave it running for so long is to get that last N%, which is nice for benchmarks but not for business. DeepMind overspends. Actually, they don't; they're not paying anywhere close to the price of a 256 core TPU. (Many external companies aren't, either, and you can get a good deal by negotiating with the Cloud TPU team.) But you don't need a 256 core TPU. Lots of times, these algorithms simply do not require the amount of compute that people throw at the problem. On the other hand, you can also usually get access to that kind of compute. A 256 core TPU isn't beyond reach. I'm pretty sure I could create one right now. It's free, thanks to TFRC, and you yourself can apply (and be approved). I was. https://sites.research.google/trc/ It kills me that it's so hard to replicate these papers, which is most of the motivation for my comment here. Ultimately, you're right: "How much compute?" is a big unknown. But the lower bound is much lower than most people realize (and most researchers). |
"IMPALA with 1 learner takes only around 10 hours to reach the same performance that A3C approaches after 7.5 days." says the paper, but I can run A3C on a cheap CPU-only server but to get that IMPALA timing, I need to spend a lot of money. But my biggest roadblock so far is that I need compute far exceeding what the papers claim.
The diagrams for IMPALA show good performance starting at 1e8 environment frames and excellent performance at 1e9 frames. By now, I'm at 2.5e9 frames and performance is still bad. In my opinion, the reason is that the sequence lengths for Bomberland are quite long. To clear a path, you place a bomb, wait 5 ticks for it to become detonatable, then detonate it, then wait 10 ticks for the fire to clear. With 7 possible actions per tick, the chance of randomly executing this 17 tick sequence becomes (1/7)^17 = 4e-15. If I calculate optimistically that all moves are valid, too, while we wait, then I can get up to (1/7)(5/7)^5(1/7)*(5/7)^10 = 1e-4. But that still means that at 1e8 env steps, I only have 1000 successful executions to learn from.