Hacker News new | ask | show | jobs
by daenz 2437 days ago
I'm not sure I follow their car crash diagram and explanation. They've laid out that one ethnicity might prefer red cars more than others, and drivers of red cars tend to get into more crashes, and that training ML with "red cars" as a feature would lead to a bias against that ethnicity. I got that part. What I don't get is how the creation of the "risky behavior" node can be assumed to have a completely uniform distribution of ethnicities inside of it. The author has no problem saying that an ethnicity can have one causal behavior (purchasing red cars) but not another (being riskier drivers). This seems logically inconsistent.
4 comments

There is a strong push for "fairness", see e.g. "Toronto Declaration". I think all it would do is completely halt progress of AI and install bureaucracy to the lowest decision levels, paralyzing whole ML research. Nobody seems to think that we are in a clash of different cultures with different sensitivities and there is no single common platform for stating what is "fair". I am worried the loudest voice would set the trend and we will have some insanity enforced all the way down. There are even calls to ban "blackbox" ML, basically allowing only trivial parts in any kind of decision making.

If members of my nation get drunk more often than some other, while it's offensive to say I am a 34% drunkard, on average it might hold; instead of forbidding this type of inference I'd rather rely on more signals to figure out what kind of person I am specifically for individualized decisions. They bypass this problem by adding "risky behavior" not contained in the input dataset so they just decide to model it as a hidden variable of Bayesian inference, where "risky behavior" might be correlated with ethnicity and red car anyway, just not visible outside. So if my nation is 34% drunkard but neighboring is only 11%, the conditional probability will likely be higher for my nation anyway, but obfuscated by the use of Bayesian hidden state. I am not sure why would that improve fairness.

> There is a strong push for "fairness", see e.g. "Toronto Declaration". I think all it would do is completely halt progress of AI and install bureaucracy to the lowest decision levels, paralyzing whole ML research.

It would only paralyze those who paid attention to the Toronto Declaration. You’re right because you can’t make ML fair because the universe isn’t fair, that’s a property of human judgements about facts. The facts remain the same regardless of ones feelings.

https://www.chrisstucchio.com/pubs/slides/crunchconf_2018/sl...

AI Ethics, Impossibility Theorems and Tradeoffs

Except any 2 humans don't have matching ideas about what's fair, which means that they're both unfair from eachother's perspective.

Humans are in reality much less fair than algorithms.

> there is no single common platform for stating what is "fair".

This is the crux of the issue and as always, most people seem to miss it. Often “fair” is used as shorthand for “does what I think is right”.

"forbidding this type of inference"

Isn't this just a misleading way to say "holding a certain causal belief"? Why exactly would that be a bad thing? If you reject one set of causal beliefs, you necessarily hold a different set.

Some beliefs are correlated with reality, others don't. If GP's assertion about 34% more drinking on average is true, then rejecting it isn't "holding a different set of beliefs", it's just being wrong.

If there's an issue worth pursuing here, it's educating people to stop using average population statistics to rate individuals from populations. Usually the variance within a population makes population-level statistics useless for evaluating individuals.

Rejecting the causal relationship is not the same as rejecting the correlation, right? Why can't (or shouldn't) one separate the two?
You're right in principle, but the point here is about the reasons for rejecting a casual model. The issue people seeking fairness in statistics run into is rejecting models based on what ought to be, instead of what is. A casual model can be totally unfair, and yet also correct (insofar an approximation is considered correct).

Taking the example from our parallel discussion, if the data says being male is correlated with risky driving, and it seems to fit the casual model of "male -> risky", it would be wrong to reject it just on the grounds of "we're using this model to set insurance rates, so by penalizing males, the model is sexist". It may be that you can come up with a better casual model explaining the correlation - say, cultural history and path dependence - but until you can, rejecting a fitting model based on "it's unfair, reality ought not to be so" is just wrong.

> What I don't get is how the creation of the "risky behavior" node can be assumed to have a completely uniform distribution of ethnicities inside of it.

It's a much broader problem than that, because the direction of causation can be extraordinarily difficult to establish in general.

Changing the color of your car shouldn't change your ethnicity, but what if it does? Suppose you're white with Spanish ancestry and Hispanics are the group who like red cars. Paint your car red and some red-car-preferring Hispanics may be more inclined to associate with you and thereby cause you to be more immersed in Hispanic culture and start to identify as Hispanic rather than white.

And that's a silly one just to show that even the exemplar could be wrong. More plausibly, what if the causation between "risky behavior" and "red car" is reversed? We know that colors can affect human behavior. If getting into a red car makes you drive more aggressively then you have a direct causal chain between being more likely to buy a red car (for any reason) and being more likely to drive aggressively and get into a car crash.

That means that in order to use this you would first need to prove the direction of causation between the two behaviors. But that's a tall hill to climb when one of the factors you're trying to prove causation with is the one you don't have good data on.

There is also a straight forward way to tell when a method like this is definitely getting the math wrong -- does it make the prediction rate for that class of people worse? If your assumptions are correct then it shouldn't, so if it does then you've unambiguously failed.

Right, it seems very plausible that car culture differs in different cultures. Is it truly unreasonable to suggest that perhaps more than an average number of Italians are fast aggressive drivers? From what I've heard and seen, it's anecdotally true. I wouldn't rule out the possibility of it being a statistically true.

And every time I express my desire for autobahns without speed restrictions to crisscross North America, whoever I'm talking to has generally been quick to inform me that Germans can have nice things like that because they are careful/skilled drivers, while Americans are reckless (wreckful) drivers and cannot be trusted at high speeds.

If more than an average number of Italians are fast drivers, it doesn't mean being Italian causes being a fast driver. Is the idea that correlation is not causation in this context really breaking everybody's brain?

Now you may argue that correlation reflects causation in a particular case, sure, but in general, it is not the same, so it seems perfectly logical to me to point out that you can start building your model with certain causal assumptions and without others, without in any way disregarding your statistics.

Is it so hard for you to believe there might plausibly be a causation?

Consider the case of African Americans who are discriminated against by traffic cops. Is it plausible that African Americans, in an attempt [perhaps in vain] to minimize interaction with traffic cops, are more cognizant of traffic laws and drive more conservatively than the average American? I don't know if the data supports that hypothetical, but it seems plausible to me.

Assuming that this were the case, if you were to assume that African Americans drove as well as white Americans, you would be discriminating against the African American population by failing to recognize their safer driving habits.

Whether you or I think there is a causation in specific cases is irrelevant, as is whether we apply charged terms like "racism" to certain causal linkages.

The point is that one is not compelled to believe in the causal link just because there is a statistical link.

So if certain causal links are politically contentious, rejecting them due to "political correctness" is completely separate from rejecting the facts, the statistics that are collected. It is political, but not in opposition to reality.

The article, as I understood it, is puncturing the assertion of objectivity by those who implicitly assert we have to regard all correlations as equally causal or else be against reason and logic.

I certainly do not believe anybody should be compelled to assume a correlation is a causation. However I also do not think one should preemptively rule out the possibility of causation. Without examining the nitty gritty details of any particular situation, we can't know which is the case. We certainly cannot assume one and rule out the other, which I fear is what you assumed I was doing.
The elephant in the room is that the real way to tell whether that is the case would be to use race as a factor the same as age or sex. If African Americans are more careful drivers then that would detect it and take it into account.

But then you have to take the bad with the good. If it turns out that strict adherence to traffic laws that nobody else abides is actually more dangerous than following the normal flow of traffic, it would also detect that and take it into account.

Well it may be the case that they accidentally have a proxy for race already in their data (the "this ethnicity prefers red cars" hypothetical in the article above.) So race may already in practice be factored in despite nobody intending for that to be the case (assuming nobody anticipated that a particular metric is a racial proxy.) That does not necessarily mean it's being unfair to that race though. It could, hypothetically, mean that it's actually being fair to that race, advantaging them in a system/society that would otherwise disadvantage them.
The whole "proxy for race" thing is such a mess.

The original problem was that racists were not just taking race into account but were disproportionately penalizing certain races. They would literally just refuse to do business with black people. And then once that was made illegal, they would refuse to do business with people from black neighborhoods (redlining), i.e. use location as a proxy for race so they could continue to refuse to do business with black people under that pretext.

Normal Bayesian statistics doesn't do that because it's missing the actually racist piece of it, which is giving disproportionate weight to race (or something that correlates with race) so that you refuse disproportionately many people of a particular race for no legitimate reason.

The unfairness never came from taking into account some factor that correlates with race, or even race itself to the extent that it actually correlates with outcomes. It came from using a factor to deny service even though it didn't correlate with outcomes, or if it did then still not proportionately to the huge negative weight assigned to it. It came from giving race, or a proxy for race, disproportionate weight. Giving it proportionate weight isn't unfair, it's the only thing that is fair.

"then that would detect it and take it into account."

You have a method for automatically deriving causal relationships from correlational data?

A lack of a causal relationship wouldn't matter in that case. If something correlates with the outcome then it allows you to better predict the outcome even if it isn't the cause, because it at least correlates with the cause or it wouldn't correlate with the outcome.

Though obviously if it isn't the cause then you're better off taking into account the true cause rather than only the thing that correlates with it -- which would cause the correlation with the outcome to disappear for the non-causal factor when you take into account both.

I think the two behaviors should be understood as arbitrary for illustrative purposes. The point is, as I understand it, that you can decide that one causal relationship exists and another does not, and derive a model consistent with that and with the observed statistics.

Because, as people give lip service to constantly, but never seem to really adhere to, correlation is different from causation.