| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joshuamorton 2963 days ago

Oh cool, I've also done a presentation on bandit methods for early-undergrads (see: https://gtagency.github.io/2016/experimentation-with-no-ragr..., its missing speaker notes so it looks a bit strange, but that outlines the structure fairly well). It was also sort of an intro to the beta distribution and why you should love it, hence the focus on Thompson Sampling.

Some critiques:

- I feel like your justification/explanation for why this is useful is a bit lacking. Personally I find framing it in terms of regret-minimzation better than gain-maximization, even though in practice they're the same. I think it frames the situation in such a way where you go in knowing that you will have to pick non-optimal things some, so your job is to learn the underlying distributions as quickly as possible, instead of trying to pick the best things. Interestingly, I think thinking of it as gain-maximization leads you down an epsilon greedy path, whereas regret-minimzation leads you toward UCB1/Thompson Sampling better. Since you pivot to RL instead of just bandits, I can kind of understand it, but see my last point.

- As a general rule, I try to minimize math in undergrad-focused talks/documents. Even as someone who spends a lot of time explaining statistical concepts to people, my eyes glaze over when I see `q*(a) = E[Rt|At = a]`. Obviously you need some and this is just a personal thing. For the most part I actually think you do a decent job of explaining the equations you use. At least until the gradient bandit part :P Then it just feels like a textbook proof excerpt.

- Nit: You don't fully explain that epsilon-greedy is greedy, except epsilon of the time. That caught me up for a second.

- The last thing is that I feel like the motivation and difference between stationary and nonstationary reward distributions isn't well explained. Nonstationary rewards don't really "fit" the mental model behind k-armed bandits a lot of the time. I'm actually curious for a better motivation there, as I can't articulate one myself.

1 comments

oneraynyday 2963 days ago

Hey Joshua, thanks so much for the criticism. I do see your point towards epsilon greedy vs. regret minimization. I will add a section about that before presenting UCB1. I also will add more information in the epsilon-greedy strategy section, elaborating on what the epsilon is really for. I'm not 100% sure how to reframe the nonstationary reward situation, because I feel like that adds state dependent on t to the bandit scenario, which then feels more like MDP.

link

joshuamorton 2963 days ago

>because I feel like that adds state dependent on t to the bandit scenario, which then feels more like MDP.

Yeah this was sort of exactly the issue I was running into. I can't justify it to myself without essentially saying "this is just an MDP in disguise", which maybe is the right way to do it. I'm pretty sure you can define a k-armed bandit as an MDP on a single state, where each action corresponds to a machine, and all actions return you to the single state.

So maybe that is the right motivation. But reversing that "an MDP is just a k-armed bandit problem where sometimes playing a machine breaks it and forces you to play other machines, which can impact how quickly the casino fixes your first machine..." feels forced.

All that said, its a good article :)

link