|
|
|
|
|
by AndrewKemendo
873 days ago
|
|
> time-dependent likelihood pathways Future reward trajectories are THE core focus of multi-step MDP, see Sutton [1] "Now we consider transitions from state-action pair to state-action pair, and learn the value of state-action pairs. Formally these cases are identical: they are both Markov chains with a reward process. The theorems assuring the convergence of state values under TD(0) also apply to the corresponding algorithm for action values: " I wasn't going to differentiate in my original post between sub-types of "cycles" within increasingly complex MDP's for long sequence reward estimation: [1]http://incompleteideas.net/book/ebook/node64.html |
|
Markov processes are nice because they are simple objects and therefore have nice properties and solid mathematical proofs.
Many mathematical models are studied because they have nice theoretical properties and one can prove theorems about them. This should not be mistaken with an actual mechanistic explanation for complex emergent phenomena like human decisions.