|
|
|
|
|
by oneraynyday
2960 days ago
|
|
Hey Joshua, thanks so much for the criticism. I do see your point towards epsilon greedy vs. regret minimization. I will add a section about that before presenting UCB1. I also will add more information in the epsilon-greedy strategy section, elaborating on what the epsilon is really for. I'm not 100% sure how to reframe the nonstationary reward situation, because I feel like that adds state dependent on t to the bandit scenario, which then feels more like MDP. |
|
Yeah this was sort of exactly the issue I was running into. I can't justify it to myself without essentially saying "this is just an MDP in disguise", which maybe is the right way to do it. I'm pretty sure you can define a k-armed bandit as an MDP on a single state, where each action corresponds to a machine, and all actions return you to the single state.
So maybe that is the right motivation. But reversing that "an MDP is just a k-armed bandit problem where sometimes playing a machine breaks it and forces you to play other machines, which can impact how quickly the casino fixes your first machine..." feels forced.
All that said, its a good article :)