|
|
|
|
|
by gwern
2896 days ago
|
|
> There's nothing about PPO that helps it learn long-range strategies. Exactly. Which is why it's so surprising that it did anyway despite that and discount rates which don't give any value past a minute or so. > DeepMind was also able to train a CTF agent with human-level reaction time: https://deepmind.com/blog/capture-the-flag/ Note that the CTF agent is way more complex, featuring multilevel RL and evolutionary losses, and even DNC in the agents. |
|