| HN Mirror

If you input a score ("{-1,0,1} on final positions") its effectively a label, that makes the training supervised rather than unsupervised. See [1] for good reasons to be skeptical of unsupervised learning in general.

See [2] for a twist on the DeepMind Atari player. They use Monte Carlo Tree Search (MCTS of automated Go playing fame) to generate training data. By feeding that more carefully generated gameplay data into the deep q-learning net, they exceed DeepMind's (non-MCTS-coupled) performance.

1. http://karpathy.github.io/2014/07/03/feature-learning-escapa...

2. http://www-personal.umich.edu/~rickl/pubs/guo-singh-lee-lewi...