Hacker News new | ask | show | jobs
by JoshCole 1425 days ago
If you apply the cognitive biases model to algorithms which have superhuman performance in various games - like AlphaZero, DeepBlue, Pluribus, and so on - the natural result is to conclude that these models are predictably irrational. The reason you get this conclusion is because it turns out to be necessary to trade off theoretical optimal answers for the sake of speed. The behavioral economic view of human irrationality ought to be considered kind of dumb in view of that result. But it is actually so much worse than that for the field, because the math shows that sacrificing optimality for speed would be something that even an infinitely fast computational intelligence would be forced to do. It isn't irrational; it is a fundamentally necessary tradeoff. In imperfect information games your strategy space is continuous, EV is a function of policy, and many games even have continuous action spaces. If you thought Go was high branching factor you thought wrong; Go is an example of a freakishly low branching factor. It is infinitely smaller than the branching factor in relatively trivial decision problems.

If you've never looked at cognitive biases through the lens of performance optimization you should try it. What seems like an arbitrary list from the bias perspective becomes clever approximative techniques in the performance optimization perspective.

I often think about why this isn't more commonly known among people who call themselves rationalists and tend to spend a lot of time discussing cognitive bias. They seem to be trending toward a belief that general superintelligence is of infinite power, doubling down on their fallacious and hubristic appreciation for the power of intelligence.

I say this, because when you apply the algorithms that don't have these biases - the behavioral economist view wouldn't find them to be irrational since they stick to the math, they follow things like the coherence principles for how we ought to work with probabilities as seen in works by Jayne, Finett, and so on - they either don't terminate, or, if you force them to do so... well... they lose to humans; even humans who aren't very good at the task.

4 comments

>why this isn't more commonly known among people who call themselves rationalists

because most of these people do nothing else but writing blogs about rationalism. Same reason university tests are sometimes so removed from practicality compared to evaluation criteria in business, the people who make them do nothing else but write these tests.

I suspect if you put some rationalists into the trenches in the Donbass for a week they'd quickly have a more balanced view of what's needed to solve a problem besides rational contemplation.

The thing about continuous space solutions is that they are typically differentiable, which means you can use a gradient descent or LM optimization rather than needing to fully explore the solution space. Typically there are large regions which are heuristically excludable, which is what you are getting at I think, but even an unbiased sampling plus gradient descent often makes problems much more tractable than discrete problems.
The type of learning problem where I agree with your point is in something like learning how to classify hand written digits. My point about the continuous nature being unsearchable in practice is about recursive forms - if I choose this policy, my opponent will choose to react to the fact that I had that policy.

In your learning problem where thing were made tractable by differentiation you have something like an elevation map that you are following, but in the multi-stage decision problem you have something more like a fractal elevation map. When you want to know the value of a particular point on the elevation map you have to look for the highest point or the lowest point on the elevation map you get by zooming in on the area which is the resultant of your having chosen a particular policy.

The problem is that since this is a multi-agent environment they can react to your policy choice. So they can for example choose to have you get a high value only if you have the correct password entered on a form. That elevation map is designed to be a plain everywhere and another fractal zoom corresponding with a high utility or a low error term only at the point where you enter the right password.

Choose a random point and you aren't going to have any information about what the password was. The optimization process won't help you. So you have to search. One way to do that is to do a random search; if you do that you eventually find a differing elevation - assuming one exists. But what if there were two passwords - one takes you to a low elevation fractal world that corresponds with a low reward because it is a honeypot. The other takes you to the fractal zoom where the elevation map is conditioned on you having root access to the system.

This argument shows us that we actually would need to search over every point to get the best answer possible. Yet if we do that we have to search over the entire continuous distribution for our policy. Since by definition there are an infinite number of states a computer with infinite search speed can't enumerate them; there is another infinite fractal under every policy choice that also needs full enumeration. We have non-termination by a diagonalization argument for a computer that has infinite speed.

Now observe that in our reality passwords exist. Less extreme - notice that reacting to policy choice in general, for example, moving out of the way of a car that drives toward you but not changing the way you would walk if it doesn't, isn't actually an unusual property in decision problems. It is normal.

I get what you're saying about recursive adversarial problems and their fractal nature, but this is exactly what GANs do to great success, despite the fact that it's hard. Yes, they have to train a lot slower, but learning general strategies and patterns in opponent behaviour still works.

Your password example on the other hand is a discrete, non-differentiable example. If it was differentiable - for example instead of a true/false you got an edit distance to the real password, then passwords would be trivial to crack.

I am taking about decision problems, you are taking about learning problems. These are different. Skip past the idea that you need to learn something. You’ve finished doing so.

What happens once we learn an approximation of that landscape; a map that has error, it doesn’t correspond fully with the territory.

The cognitive bias framing calls the map biased, but if you generalize from that to a more global sense of irrationality the reasoning is in error. In a more particular situation you have a simpler game tree because it is just the game tree under the node. The lifting of constraints produces the ability to have further insight - the map has to be an approximation.

Don’t reach for edit distance; make the boolean a Maybe Boolean which needs further resolution. See that the approximation is demanded because the world isn’t setup to allow all things to be learnable. My honeypot example is simpler than reality - there exists passwords for which trying to guess the password but getting the honeypot resolves to the learner being jailed; generally the learner in the actual game wouldn’t even get to have infinite guesses either, but I made the problem simpler to expose the problem complexity in terms that learning theory would be more familiar with - the elevation maps of the error landscape that learners like to slide down.

Decision problems are a subset of learning problems. As soon as someone can simulate your environment there is no negative consequence to further exploring the solution space via differentiable evaluation methods which allow efficiently training an optimal player.
Your intuitions are steering you wrong. Think about this from first principles in light of some of the corrections I'm going to provide:

> which allow efficiently training an optimal player.

Training an optimal player is not possible in practice. We know and have known the mathematics for optimal play for decades. Since we know it we are able to calculate the amount of space such a solution would take up in memory. Again this is a studied thing. Here is Peter Norvig in Artificial Intelligence: A Modern Approach to tell you the same thing: Page 173. "Because calculating optimal decisions in complex games is intractable, all algorithms must make some assumptions and approximations."

> Decision problems are a subset of learning problems.

This framing has some benefits - it makes generalization simpler. It has some downsides too - in complicated environments it will only approximate the solution and because of that there will be times where it gets things wrong.

In theory you have at first an intractable problem at your initial training time. Then when the game begins and play has progresses you have a more tractable problem because the information available to you eliminates parts of the game tree from consideration. The result of this is that we actually have two learning problems - not one learning problem. One is computed prior to the game. The other is computed during the game.

This theoretical issue has been studied and found to exist in practice by DeepMind. They tried training agents that didn't use tree search and just used the learned heuristic. These lost to agents that also used tree search.

Here is a section from a talk by Noam Brown - he briefly covers your intuition and why it breaks down.

1. https://youtu.be/cn8Sld4xQjg?t=2241

Here is another talk in which he goes over the research results of DeepMind and shows the data which stands against your thesis:

2. https://youtu.be/cn8Sld4xQjg?t=1886

This is also something you can see without reference to theory by looking at the physical progress on optimal solutions. Chess solving for example has the solutions via the end game tables, but they only have them for the more specific instances you reach near the end of the game tree. It is widely understood that we don't have enough memory to store the full solution to the game.

> As soon as someone can simulate your environment there is no negative consequence

This is a non-physical claim. There is obviously a cost to computation. It consumes both energy and time. Our best understanding is that we have a finite amount of these. Your theoretical approach isn't physically real.

> As soon as someone can simulate your environment...

It doesn't become easy at this point. It remains intractable.

A very simple example of why it doesn't get easy is the halting problem from computer science.

A more complicated example that you will have to really think about in order to understand is the nature of the equilibrium adversarial strategy. It is defined with a respect to an oracle - something which would be able to perfectly simulate its strategy. And it is trying to not lose to an oracle; it is assuming you have a very good map.

You've got to remember - your simulation is your map - it isn't the territory. When you play, you aren't playing on your map. You are playing in the territory via your map. The equilibrium strategies were already assuming you had a map. So they aren't trying to make it easy for your map to give you the right answer. They are trying to make some places un-mappable.

Again - remember the real world. Do I know your password? Why not? And what is my password, if it is so easy to know it?

Only if the local optima are good.
That's a matter of designing a good cost function.
> apply the cognitive biases model to algorithms which have superhuman performance in various games

Could you give an example of this?

- Anthropic bias

The algorithms have this tendency. They use counterfactual reasoning to determine that assuming a nash player alike to them is their opponent when making their decisions. Sometimes they don't have a nash opponent, but they persist in this assumption anyway. In the cognitive bias framing this tendency is error. In the game theoretic framing this corresponds with minimizing the degree to which you would be exploited. You can find times where the algorithm plays against something that isn't nash and so it was operating according to a flawed model. You can call it biased for assuming that others operated according to that flawed model. From a complexity perspective this assumption lets you drop an infinite number of continuous strategy distributions from consideration - with strong theoretical backing for why it won't hurt you to do so - since nash is optimal according to some important metrics.

- Attentional bias

The tendency to pay attention to some things and not other things. Some examples of times where we do that are with alpha beta pruning. You can find moves that involve sacrifice that show the existence of this bias. The conceit in the cognitive bias framing is that it is stupid because some of the things might be important. The justification is that it some things are more promising than others and we have limited computational budget. Better to stop exploring the things which are not promising since they are not promising and direct efforts to where they are promising. Something like an upper confidence bound tree search in the cognitive bias model would turn balancing the explore exploit dynamic as part of approximating the nash equillibrium into erroneous reasoning because it doesn't choose to explore everything is an example of the lesser form of anchoring effects as they relate to attentional bias. It weights the action values according to the promising rollout more highly.

- Apophenia

Hashing techniques are used to reduce dimensionality. There is an error term here but you gain faster reasoning speed. Seen in blueprint abstraction - the poker example I gave - since we've hashing down using similarity to help bucket similar things. This gives rise to things like selective attention (another bias, and kind of related to this general category of bias).

Jumping ahead to something like confirmation bias the heuristic that all these algorithms are using are flawed in various ways. They see that they are flawed after a node expansion and update their beliefs, but they don't update the heuristic. In fact if a flawed heuristic was working well such that it won we would have greater confidence rather than lesser confidence in the bias.

---

Putting all that aside I would caution against specifity in understanding my point. I think approaching it in this direction - very specific examples - is horrible because it directs attention to the wrong things; when you look at specific examples you're always in a more specific situation and if you're in a more specific situation it means that your situation is more computationally tractable than the general situation which was being handled by the algorithm. So trying to focus on examples is actually going to give you weird inversions where the rules that applied in general don't apply to the specific situation.

You need to come about it from the opposite direction - from the problem descriptions to the necessary constraints on your solution. Then it happens that the error in reasoning is a natural result of trying to do well.

It sound like you're talking about, or at least brushing up against, prudential judgement[0]. Sometimes, the optimal move is not to seek the optimum.

An obvious class of problems is where determining the optimum takes more time than the lifetime of the problem. Say you need to write an algorithm at work that does X, and you need X by tomorrow. If it would take you a week to find the theoretical optimum, then the optimum in a "global" sense is to deliver the best you can within the constraints, not the abstract theoretical optimum. The time to produce the solution is part of the total cost. An imprudent person would either say it's not possible, or never deliver the solution in time.

[0] https://www.newadvent.org/cathen/12517b.htm

Yeah, that is pretty close to what I'm talking about. Coming at it from a different perspective - learning theory - but it seems to be the same overarching idea. I'm extending it a little though to something similar to anachronistic reasoning being incorrect - you can't divorce prudential decisions from their context. When you do judgement of the decisions is flawed because it doesn't acknowledge the actual constraints the decision was made under.