|
|
|
|
|
by joshuamorton
2963 days ago
|
|
Oh cool, I've also done a presentation on bandit methods for early-undergrads (see: https://gtagency.github.io/2016/experimentation-with-no-ragr..., its missing speaker notes so it looks a bit strange, but that outlines the structure fairly well). It was also sort of an intro to the beta distribution and why you should love it, hence the focus on Thompson Sampling. Some critiques: - I feel like your justification/explanation for why this is useful is a bit lacking. Personally I find framing it in terms of regret-minimzation better than gain-maximization, even though in practice they're the same. I think it frames the situation in such a way where you go in knowing that you will have to pick non-optimal things some, so your job is to learn the underlying distributions as quickly as possible, instead of trying to pick the best things. Interestingly, I think thinking of it as gain-maximization leads you down an epsilon greedy path, whereas regret-minimzation leads you toward UCB1/Thompson Sampling better. Since you pivot to RL instead of just bandits, I can kind of understand it, but see my last point. - As a general rule, I try to minimize math in undergrad-focused talks/documents. Even as someone who spends a lot of time explaining statistical concepts to people, my eyes glaze over when I see `q*(a) = E[Rt|At = a]`. Obviously you need some and this is just a personal thing. For the most part I actually think you do a decent job of explaining the equations you use. At least until the gradient bandit part :P Then it just feels like a textbook proof excerpt. - Nit: You don't fully explain that epsilon-greedy is greedy, except epsilon of the time. That caught me up for a second. - The last thing is that I feel like the motivation and difference between stationary and nonstationary reward distributions isn't well explained. Nonstationary rewards don't really "fit" the mental model behind k-armed bandits a lot of the time. I'm actually curious for a better motivation there, as I can't articulate one myself. |
|