| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andy_xor_andrew 409 days ago
	the magic thing about off-policy techniques such as Q-Learning is that they will converge on an optimal result even if they only ever see sub-optimal training data. For example, you can use a dataset of chess games from agents that move totally randomly (with no strategy at all) and use that as an input for Q-Learning, and it will still converge on an optimal policy (albeit more slowly than if you had more high-quality inputs)

1 comments

Ericson2314 409 days ago

I would think this being true is the definition of the task being "ergodic" (distorting that term slightly, maybe). But I would also expect non-ergodic tasks to exist.

link