| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by smallnamespace 2894 days ago

What's the tradeoff vs. just taking a direct Bayesian approach?

In fact, why use an inferential framework at all (estimating some sort of probability and using it to guide action), rather than directly using a policy learning framework, e.g. modeling this as Q-learning or multi-armed bandit problem?

If at the end of the day you have some objective function (e.g. 'making money'), some known space of actions (e.g. move this widget up the page, change the color, engage with user this way), and a reasonable way to associate those two, then isn't the company literally doing reinforcement learning over time?

It seems one benefit of a reinforcement learning framework is it maintains a set of actions that will still be explored in the future without forcing you to prematurely 'choose' whether A or B is actually better—if A is better in reality, then it will be explored more and more often and B will progressively become downweighted over time.

3 comments

srean 2893 days ago

> If at the end of the day you have some objective function

That "If" often evaluates to false.

There are tough judgement calls involved in selecting what is that metric that the org wants to optimize. It is very rare that business management commits to a clear quantitative goal. Reasons are many -- weasel room is important politically, selecting a metric that captures short term and long term goals is difficult, there is a lot of uncertainty in the costs due to uncertainty on how overhead should be billed etc etc.

This is fairly common. Typically, in these situations its the PMs who make the final call. There the goal of the experiment is to glean as much knowledge as possible, and present it to the PM. If that comes at the cost of exposing some customers to bad choices, so be it -- in other words, explore at the cost of losses in the opportunity to exploit.

link

dmichulke 2894 days ago

> why use an inferential framework at all

Probably because of the maintenance cost of the code that was only explored but never exploited.

link

alexgmcm 2893 days ago

A policy learning approach is better imho - but getting people to switch to using a multi-armed bandit when they are used to AB testing can be difficult.

People don't seem to trust the system to make the right decisions even though you can do simulations and have the mathematics to show it is correct.

link