Hacker News new | ask | show | jobs
by oneraynyday 2968 days ago
Yes, it is as you stated. Due to the fact that bandits are stateless, there is no state parameter in $q_(a,s)$. From where I learned it, this could arguably be an abuse of notation to use $q_$ in the same context. In my newer entry(which is currently WIP), it uses $q_*(a,s)$ and uses cumulative sum of the future rewards(with discount). Thanks for the reply guys :)