Hacker News new | ask | show | jobs
by smallnamespace 2894 days ago
What's the tradeoff vs. just taking a direct Bayesian approach?

In fact, why use an inferential framework at all (estimating some sort of probability and using it to guide action), rather than directly using a policy learning framework, e.g. modeling this as Q-learning or multi-armed bandit problem?

If at the end of the day you have some objective function (e.g. 'making money'), some known space of actions (e.g. move this widget up the page, change the color, engage with user this way), and a reasonable way to associate those two, then isn't the company literally doing reinforcement learning over time?

It seems one benefit of a reinforcement learning framework is it maintains a set of actions that will still be explored in the future without forcing you to prematurely 'choose' whether A or B is actually better—if A is better in reality, then it will be explored more and more often and B will progressively become downweighted over time.

3 comments

> If at the end of the day you have some objective function

That "If" often evaluates to false.

There are tough judgement calls involved in selecting what is that metric that the org wants to optimize. It is very rare that business management commits to a clear quantitative goal. Reasons are many -- weasel room is important politically, selecting a metric that captures short term and long term goals is difficult, there is a lot of uncertainty in the costs due to uncertainty on how overhead should be billed etc etc.

This is fairly common. Typically, in these situations its the PMs who make the final call. There the goal of the experiment is to glean as much knowledge as possible, and present it to the PM. If that comes at the cost of exposing some customers to bad choices, so be it -- in other words, explore at the cost of losses in the opportunity to exploit.

> why use an inferential framework at all

Probably because of the maintenance cost of the code that was only explored but never exploited.

A policy learning approach is better imho - but getting people to switch to using a multi-armed bandit when they are used to AB testing can be difficult.

People don't seem to trust the system to make the right decisions even though you can do simulations and have the mathematics to show it is correct.