|
|
|
|
|
by j7ake
872 days ago
|
|
Why are you using a Markov process though to model time-dependent likelihood pathways ? Doesn’t make sense. Your next step depends on much more than just knowing where you are at S. One needs to account for the history of where you were before. Or maybe you’re just using technical words with precise meanings to describe a vague imprecise heuristic? |
|
Future reward trajectories are THE core focus of multi-step MDP, see Sutton [1]
"Now we consider transitions from state-action pair to state-action pair, and learn the value of state-action pairs. Formally these cases are identical: they are both Markov chains with a reward process. The theorems assuring the convergence of state values under TD(0) also apply to the corresponding algorithm for action values: "
I wasn't going to differentiate in my original post between sub-types of "cycles" within increasingly complex MDP's for long sequence reward estimation:
[1]http://incompleteideas.net/book/ebook/node64.html