| HN Mirror

While a contextual bandit could be learning using more parameters about the environment than the multi-armed bandit posed in the example, it still has the goal of finding the best reward given the specific domain (while by and large using the same strategy as the multi armed bandit). But placed in a different context, the state space and underlying model parameters could be completely different, so the previous reward for any given state, or sequence of states, could be irrelevant.

The goal here is to reward the agent for the search strategy they employed to arrive at their answer, not the quality of the answer itself.

One possible use case (directly related to their example with multi-armed bandits, possibly learnt by a contextual bandit but requires a good deal more modeling) could be retail pricing, where different categories of products have drastically different demand curves. A meta-algorithm has the promise of generalizing better and rapidly arriving at the optimal pricing across a wide range of similar price curves.