Hacker News new | ask | show | jobs
by JoshCole 1483 days ago
Honestly, the more I've thought about this problem the more that statement bothers me. It is just such a nonsense goal.

The reason the calculations are wrong are fundamentally related to why correcting the calculations leads to the correct probabilities. If you understand why the problem gives you the wrong result, you proceed to correcting it, and you get the right result - that doesn't mean you didn't understand. It means you did. You can tell you did, because you have the correct answer.

But if I show the calculations produce the correct answers, well, that is a step too far. We can now dismiss the understanding on the basis that it got the correct answer, apparently?

It's no wonder the author of the wiki writes such a bullshit claim - that no one can agree on a definitive conclusion. Their terms of engagement are self-defeating. Correctness is error, because correctness means you didn't /really/ understand. Its a no true scottsman fallacy - get the right answer, and you aren't engaging with the 'real' problem.

But its a flawed no true scottsman, because it defines something measurable: it tells us the goal, that it is to avoid this type of mistake in our thinking. And generalized algorithms for solving decisions problems that include this as a case which is successfully solved are many - and they just so happen to be in the imperfect information setting. And that setting has studied the problem of infinite recursion and has solutions for them. Which I can apply. To get the right answer. And, in general, avoid the problems they claim we aim to avoid.

More importantly and to the point of this thread - this isn't an intractable debate, because its been solved and used in production settings for literally decades. So what if some people are going to pretend it isn't? This isn't an unsolved problem. The actual game theory math is /well/ beyond this level of complexity. Its contending with things like environments where you have so much complexity you have to reduce to a blueprint abstraction, not stumbling at a decision problem that is quite literally simpler than rock paper scissors.

1 comments

You seem to misunderstand the point of that constraint. Correction is not an error or a "step too far," it's just insufficient.

You can arrive at the correct conclusion either by pinpointing exactly where the original argument is incorrect, or you can come up with a completely different argument that does not have an error.

The puzzle challenges you to pinpoint the error because coming up with the correct solution is trivial (and the puzzle is deliberately set up this way).

This is not a "nonsense goal." If this came up in real mathematical research - two papers coming to contradictory conclusions - and no one could find where either paper went wrong, we would have a real paradox on our hands.

I think this, because I disagree with most people here about what the actual paradox is.

I think the paradox is that the algorithm equates the expected value of the contents of an envelope with the expected value of a policy choice for a player. When I correct what I feel is the root of the paradox, my solution drastically differs in fundamental ways such that the way the problem restricts to pointing out the wrong step feels disingenuous.

The entire structure is wrong, because even if you do correct the error that leads to the wrong EV for the envelope, you still haven't resolved the paradox. The right probabilities don't resolve the paradox, because they still imply that always switching has the same EV as not switching. If they were really equal, I could always choose switch, but I can't - so the paradox is still there.

My resolution ends up being so critical of their argument that the entire way they go about solving gets thrown out. I end up seeing, not just a specific wrong EV calculation, but a decision problem that is just fundamentally using an inappropriate algorithm to determine the policy function.

With all due respect ... this is basic probability theory. It's not really controversial what the solution is. The article's failures are mainly pedagogical.

We can agree that "the entire structure is wrong" because the "entire structure" is giving a wrong formula for the EV and saying "this is the formula for the EV."

Yes, switching and not switching have the same EV and you can always switch.

> We can agree that "the entire structure is wrong" because the "entire structure" is giving a wrong formula for the EV and saying "this is the formula for the EV."

We aren't in agreement about this. I realize we have to fix this, but fixing it doesn't resolve the paradox. It is a red herring.

> Yes, switching and not switching have the same EV and you can always switch.

With all due respect, this isn't true and asserting this doesn't resolve the paradox. See my other reply for why it doesn't resolve the paradox.

Mike, even when you get the correct result using the right rules the expected value of an envelope is not the expected value of a policy. I'm not agreeing with you that the problem is the expected value calculation. I'm telling you the problem is deeper than that. Fix the expected value calculation and you still have a paradox, because you are making a decision not on your expected value but on the expected value of the envelope. These are two different things and a rational person shouldn't eliminate the dependency between their policy and the reward they get.

To try and stress to you how big a problem this is, pretend you were playing the game and you wanted to find the right move that was going to make your EV the highest. Now, if you go with the logic of the problem, you are allowed to select not on the basis of your EV, but on the basis of some other EV. So instead of choosing the envelope, why not choose something else that has no relationship with our EV? Say, the weather in Alaska. If it is sunny, we like sunny. So switch. If it isn't sunny, we don't like that. So keep. It is crazy to do this, because there is no relationship between the EV you are using as a selection criteria and the EV your policy gets. This is the same situation as using the EV of the envelope. It sounds really crazy when you use the Alaska example, because it is so obviously unrelated. It sound so reasonable when you use the envelope example, because it isn't as obvious that they are unrelated. Yet for the policy of P(switch)=1, the ev of the envelope and the ev of the policy with respect to the game are not the same thing.

Now imagine the wikipedia article for the Alaska problem variant of the two envelope problem. Do you really think everyone would be so focused on the EV of the envelope as the step that was wrong? How could they? We have the same paradox still, but there is no EV calculation for the envelope included in the problem. If we can remove the EV calculation, yet still have the same paradox, it seems to me the paradox is not the expected value calculation.

So what is my solution? Well, to actually find your correct policy function you need to get the argmax of the policy with respect to the game. There are multiple ways to do this:

- Reinforcement learning does it by finding argmax pi with respect to Q_pi(s, a) = R(s') + P(keep)Q_pi(s',keep) + P(switch) Q_pi(s', switch).

- Game theory sets it up a bit differently. You define a similar graph using a different formalism, but simplified to operate over information sets. You can use a thing called regret matching; basically it turns out that if you play in proportion your normalized counterfactual regret, the average of those policies is the optimal best response.

In both cases you need to do something about the fact you're actually on an infinite graph. So the actual solution in the general case looks very very different from their way of solving the problem. It isn't just simple probability; I mean, it is, but taking the limit of an infinite sequence and taking advantage of the properties of markov chains aren't usually what I think of when someone tells me that something is simple probability. That is one formalism. In the other, we do have simple probability, but it isn't necessarily obvious that the central limit theorem gives us the optimal policy when we play in proportion to not just our regret, but our counterfactual regret. So yes, simple probability, but also, most people who know simple probability don't necessarily even know what a counterfactual is. So maybe not that simple after all.

But lets say we stick to the problem. We are here to learn how to avoid this problem, right? Nope. If you don't do things like this, you'll just be wrong in more complicated situations. Because the EV of the envelope is not the EV of the game with respect to your policy. This gets increasingly true as your imperfect information games get more complicated; it is very true of complex real world situations. The value of a wallet with a hundred dollars in the real world is different depending on whether you got that wallet with a policy function of robbing people versus earning it at your work. I feel sticking to their formalism means you end up conceited with regard to your ability to protect yourself from this paradox, because you consider yourself a master of the expected value of the envelope, but you're still vulnerable to the paradox, because the expected value of the envelope isn't the expected value of the game with your policy. So sticking with the problem is the opposite of protecting yourself.

I'm so far from what they want the problem to focus on, but they are wrong to focus on that. They aren't protecting themselves from making the same mistake. They're dooming themselves to use the wrong tools for solving this problem.. So they will make this mistake and they'll even be more confident in themselves as they do it, because they were clever and did the wrong thing in a better way, calculating the EV correctly, but staying within the land of paradox despite that.