Hacker News new | ask | show | jobs
by bubblyworld 473 days ago
What an awesome project! I'm curious - I would have thought that rewarding unique coordinates would be enough to get the agent to (eventually) explore all areas, including the key ones. What did the agents end up doing before key areas got an extra reward?

(and how on earth did you port Pokémon red to a RL environment? O.o)

2 comments

The environments wouldn't concentrate enough in the Rocket Hideout beneath Celadon Game Corner. The agent would have the player wander the world reward hacking. With wild battles enabled, the environments would end up in Lavender Tower fighting Gastly.

> (and how on earth did you port Pokémon red to a RL environment? O.o)

Read and find out :)

Thanks haha, I kept reading =D I see, so it's not just that you have to visit the key areas, they need to show up in the episodes enough to provide a signal for training.
Yup!
you dont port it you wrap it. you can put anything in an rl environment. usually emulators are done with bizhawk, and some lua. worst case theres ffi or screen capture.
Right, my thought was that this would be way too slow for episode rollout (versus an accelerated implementation in jax or something), but I guess not!
well thats the golden issue with rl, sample efficiency. it is env bounded, so you want an architecture that extracts the max possible information from each collected sample, avoiding catastrophic forgetting, prioritizing samples according to relevance
My first version of this project 5 years ago involved a python-lua named pipe using Bizhawk actually. No clue where that code went