Hacker News new | ask | show | jobs
by xyhopguy 2964 days ago
I think that's how you derive UCB, but optimizing cumulative regret rather than finding the probability distribution directly. Pls correct me if I'm wrong