Hacker News new | ask | show | jobs
by orasis 2643 days ago
This seems to be a contextual bandit where the previous reward is included in the context.

I can’t come up with real world examples where the behavior of the reward function is changing like this to warrant making different decisions based on a previous reward.

1 comments

While a contextual bandit could be learning using more parameters about the environment than the multi-armed bandit posed in the example, it still has the goal of finding the best reward given the specific domain (while by and large using the same strategy as the multi armed bandit). But placed in a different context, the state space and underlying model parameters could be completely different, so the previous reward for any given state, or sequence of states, could be irrelevant.

The goal here is to reward the agent for the search strategy they employed to arrive at their answer, not the quality of the answer itself.

One possible use case (directly related to their example with multi-armed bandits, possibly learnt by a contextual bandit but requires a good deal more modeling) could be retail pricing, where different categories of products have drastically different demand curves. A meta-algorithm has the promise of generalizing better and rapidly arriving at the optimal pricing across a wide range of similar price curves.