Hacker News new | ask | show | jobs
by JoshCole 1477 days ago
Mike, even when you get the correct result using the right rules the expected value of an envelope is not the expected value of a policy. I'm not agreeing with you that the problem is the expected value calculation. I'm telling you the problem is deeper than that. Fix the expected value calculation and you still have a paradox, because you are making a decision not on your expected value but on the expected value of the envelope. These are two different things and a rational person shouldn't eliminate the dependency between their policy and the reward they get.

To try and stress to you how big a problem this is, pretend you were playing the game and you wanted to find the right move that was going to make your EV the highest. Now, if you go with the logic of the problem, you are allowed to select not on the basis of your EV, but on the basis of some other EV. So instead of choosing the envelope, why not choose something else that has no relationship with our EV? Say, the weather in Alaska. If it is sunny, we like sunny. So switch. If it isn't sunny, we don't like that. So keep. It is crazy to do this, because there is no relationship between the EV you are using as a selection criteria and the EV your policy gets. This is the same situation as using the EV of the envelope. It sounds really crazy when you use the Alaska example, because it is so obviously unrelated. It sound so reasonable when you use the envelope example, because it isn't as obvious that they are unrelated. Yet for the policy of P(switch)=1, the ev of the envelope and the ev of the policy with respect to the game are not the same thing.

Now imagine the wikipedia article for the Alaska problem variant of the two envelope problem. Do you really think everyone would be so focused on the EV of the envelope as the step that was wrong? How could they? We have the same paradox still, but there is no EV calculation for the envelope included in the problem. If we can remove the EV calculation, yet still have the same paradox, it seems to me the paradox is not the expected value calculation.

So what is my solution? Well, to actually find your correct policy function you need to get the argmax of the policy with respect to the game. There are multiple ways to do this:

- Reinforcement learning does it by finding argmax pi with respect to Q_pi(s, a) = R(s') + P(keep)Q_pi(s',keep) + P(switch) Q_pi(s', switch).

- Game theory sets it up a bit differently. You define a similar graph using a different formalism, but simplified to operate over information sets. You can use a thing called regret matching; basically it turns out that if you play in proportion your normalized counterfactual regret, the average of those policies is the optimal best response.

In both cases you need to do something about the fact you're actually on an infinite graph. So the actual solution in the general case looks very very different from their way of solving the problem. It isn't just simple probability; I mean, it is, but taking the limit of an infinite sequence and taking advantage of the properties of markov chains aren't usually what I think of when someone tells me that something is simple probability. That is one formalism. In the other, we do have simple probability, but it isn't necessarily obvious that the central limit theorem gives us the optimal policy when we play in proportion to not just our regret, but our counterfactual regret. So yes, simple probability, but also, most people who know simple probability don't necessarily even know what a counterfactual is. So maybe not that simple after all.

But lets say we stick to the problem. We are here to learn how to avoid this problem, right? Nope. If you don't do things like this, you'll just be wrong in more complicated situations. Because the EV of the envelope is not the EV of the game with respect to your policy. This gets increasingly true as your imperfect information games get more complicated; it is very true of complex real world situations. The value of a wallet with a hundred dollars in the real world is different depending on whether you got that wallet with a policy function of robbing people versus earning it at your work. I feel sticking to their formalism means you end up conceited with regard to your ability to protect yourself from this paradox, because you consider yourself a master of the expected value of the envelope, but you're still vulnerable to the paradox, because the expected value of the envelope isn't the expected value of the game with your policy. So sticking with the problem is the opposite of protecting yourself.

I'm so far from what they want the problem to focus on, but they are wrong to focus on that. They aren't protecting themselves from making the same mistake. They're dooming themselves to use the wrong tools for solving this problem.. So they will make this mistake and they'll even be more confident in themselves as they do it, because they were clever and did the wrong thing in a better way, calculating the EV correctly, but staying within the land of paradox despite that.