Hacker News new | ask | show | jobs
by marketforlemmas 3649 days ago
So I've read the PP piece and your article, and I think your criticism is way off-base. The most technical part of your argument relies on a p-value being .057 vs .05, which is not a good one. No one seriously believes that .05 is magical number that determines true from false; things that are close to it but not quite below .05 are not automatically false.

They go on to give supporting analysis in form of the false positive and false negative rates by race, which is pretty compelling evidence. You claim to not believe that because you cant find it in the notebook but its literally right underneath the Cox model section.

I was intrigued by this article and went a step further to plot the ROC curves and the evidence is solid. It's messy, but you can see it here https://github.com/stoddardg/compas-analysis/blob/master/my_... in cell 78. Its quite clear that the algorithm is choosing a different point of optimization on the ROC for white people (a more lenient one) than for black people. A white defendant with a risk score of 5 is as likely to commit a crime as a black defendant is with a score of 7. That's an obvious case where you could simply relabel and be more fair but their algorithm chooses not to.

I also hate when people abuse bad statistics and reasoning to sell page views.

1 comments

I have plenty of criticisms of hypothesis testing and p-values. Nevertheless, if you choose to run that type of analysis, do it right - this means sticking with your analysis and not using weasel words like "almost statistically significant" when it doesn't come out the way you want. Incidentally, the real p-value is 11.075% since they ran two hypothesis tests and didn't adjust for multiple comparisons.

Your analysis might be right - if so, that's interesting. I'll take a closer look and write a followup piece if true - among other things glancing at your ROC curve suggests they are pretty close, and perform better for whites in some regions and better for blacks in others. But it's 7:30AM (pre-coffee) and I haven't looked closely yet.

But since PP did not do any of this, my criticism of them holds - they ran an NHST, got the wrong result, and then spouted a bunch of anecdotes instead of admitting that their analysis went against what they wanted to find.

What's your response to the GP that you seem to have missed the part where false positive/negative rates in the notebook? In your blog post, you said this:

> Finally, the article includes a table of false positive probabilities (FPP) and false negative probabilities (FNP). This may or may not be evidence of bias - the authors would need to run a statistical test to determine that, which they don't. In fact, I can't even find the place in their R notebook where they did that calculation. Is this the result of bad statistics? Is it merely random chance? Who knows!

Looking at PP's Jupyter Notebook, the calculations seem to be performed at lines 50 onwards (if you're referring to the table that I think you're referring to).

FWIW, those "weasel words" you allege are in the writeup of the methodology, where the audience is expected to follow along and see how the 0.057 is calculated. I'm not sure how you're interpreting that calculation...My read is that it's not the bedrock from which all of the other analyses are based from. Where in the story do you see that particular calculation being used as the main (or even ancillary) thrust of the piece?

Aggregate false positive/false negative rates don't prove anything. They can be caused by composition differences, which the analysis demonstrates the existence of.

But I'm sure they look convincing to ProPublica's readers who are not statistically sophisticated.

What's even more concerning is that these statistical noobs sometimes look to experts, unaware that these math geniuses may lack the literacy to read a plaintext notebook to the end.