|
|
|
|
|
by highd
3270 days ago
|
|
Yes - the issue is that the work is currently presented as requiring "no training", but it has simply relocated that problem to constructing a perfect simulation of the environment. It then uses the fact that current benchmarking systems have available simulations to "cheat" rather than learning that function itself. One of the most difficult and interesting parts of reinforcement learning is constructing the function that determines the evolution of the system. If you know the evolution function a priori the problem is mostly trivial - i.e. alpha-beta search, graph searching, etc. It's interesting that this merit function works in the absence of a real reward signal, but there's no fair comparison against systems using a reward signal due to this huge alteration to the problem that is providing a perfect simulation. |
|
Having said that there are situations where this will fail completely, e.g. in maze solving, where the goal is not to play to keep playing but to play to reach the end.