| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by highd 3270 days ago
	Yes - the issue is that the work is currently presented as requiring "no training", but it has simply relocated that problem to constructing a perfect simulation of the environment. It then uses the fact that current benchmarking systems have available simulations to "cheat" rather than learning that function itself. One of the most difficult and interesting parts of reinforcement learning is constructing the function that determines the evolution of the system. If you know the evolution function a priori the problem is mostly trivial - i.e. alpha-beta search, graph searching, etc. It's interesting that this merit function works in the absence of a real reward signal, but there's no fair comparison against systems using a reward signal due to this huge alteration to the problem that is providing a perfect simulation.

1 comments

gabrielgoh 3270 days ago

i agree completely, and that what's happening is nothing more than brute force search. Though I do think this is still interesting as the reward here is potentially much more well-conditioned than the rewards in RL.

Having said that there are situations where this will fail completely, e.g. in maze solving, where the goal is not to play to keep playing but to play to reach the end.

link

highd 3270 days ago

It seems like a more comparable reinforcement learning thing to do would be to combine the entropy criterion with a known reward when available in some way and then do Q learning on that without the simulation requirement. Then in cases where reward is uncertain or infrequent you fall back to a flexibility heuristic.

link

robertsdionne 3270 days ago

Maybe like https://pathak22.github.io/noreward-rl/

link