Hacker News new | ask | show | jobs
by gabrielgoh 3270 days ago
I think an analogy can be made with Bayesian statistics. In principle, Bayesian statistics requires no training, just a way of sampling from the posterior, usually done with expensive MCMC methods.

Here, we do not need training of any kind either, just a monte-carlo simulation of the environment and an approximation of which path has the greatest path entropy. Bsaically given a state, you do

- Compute the path entropy for all states you can move to

- Move into the state with greatest path entropy

The tradeoff here is that all the work occurs in inference - every decision requires a complex simulation. In training based approaches the heavy lifting is done during training, and inference is easy

1 comments

Yes - the issue is that the work is currently presented as requiring "no training", but it has simply relocated that problem to constructing a perfect simulation of the environment. It then uses the fact that current benchmarking systems have available simulations to "cheat" rather than learning that function itself. One of the most difficult and interesting parts of reinforcement learning is constructing the function that determines the evolution of the system. If you know the evolution function a priori the problem is mostly trivial - i.e. alpha-beta search, graph searching, etc.

It's interesting that this merit function works in the absence of a real reward signal, but there's no fair comparison against systems using a reward signal due to this huge alteration to the problem that is providing a perfect simulation.

i agree completely, and that what's happening is nothing more than brute force search. Though I do think this is still interesting as the reward here is potentially much more well-conditioned than the rewards in RL.

Having said that there are situations where this will fail completely, e.g. in maze solving, where the goal is not to play to keep playing but to play to reach the end.

It seems like a more comparable reinforcement learning thing to do would be to combine the entropy criterion with a known reward when available in some way and then do Q learning on that without the simulation requirement. Then in cases where reward is uncertain or infrequent you fall back to a flexibility heuristic.