Hacker News new | ask | show | jobs
by ZephyrBlu 1907 days ago
> But nobody knows why the issue occurs

They do know why that occurs. It's because the data set is biased.

3 comments

No, you don't "know" that your dataset is biased until you perform the statistic analysis explicitly. It might be that your neural net has a non-uniform weight distribution in some dimension (e.g. in time, or in the ordering of the training data), so dismissing any unwanted results by claiming "your dataset is biased" is a form of appeal to (artificial) authority.
It's not an appeal to artificial authority. It's a very likely root cause and comes with a solution even: get a different data set or adjust your existing data set. Your response rings of No true Scotsman to me since you can argue any analysis is not rigorous enough or doesn't cover all potential issues.

edit: And a statistical analysis isn't some sort of magic data genie. Statistics can give rigorous results because it makes strong assumptions. If those assumptions don't hold then the results aren't rigorous anymore. A trillion parameters model can pull interactions out of your data that almost no statistical analysis of the data would identify ahead of time. So what you need to analyze is the model and try to infer why it's predicting different certain results and then work backwards from there.

Is it a case of BAME have less stable families, are poorer, commit more crimes and therefore are more represented in the data which lead to even more incarcerations?
First: I don't think you can claim that without also doing some very rigorous statistics. I'm not asking you to, but if you're going to base policy on that statement rather than merely arguing on the internet, you'd need to.

Second: even if you do, you're going to have a hard time controlling for the fact that the police and criminal justice system has a long history of disproportionately enforcing the law against people of color. The base data about who commits crimes, gets convicted, etc. for well over a hundred years is going to reflect this bias.

I'm not claiming to have done my homework here either, same disclaimer applies. I suspect you could find somebody who does study this if you wanted to look.

All of the above assumes we're discussing the US, btw.

Funny, I remember thinking the questions were biased, every time they weren't the ones I studied to answer in my exam preparations.

Too bad I wasn't a data scientist or else I could just get a passing grade by claiming the questions were chosen from a biased data set, or retake the exam until the data set matched the questions I studied for, at which point the data set would no longer be biased, lol.

Funny line of work, this data 'science' where you only use the results that fit the narrative you wanted in the first place.

We're in full doublethink mode, just keep repeating data 'science', 'science', 'science'. :)

I did a geography O-level (the UK exams for 16-year olds at the time) which included a map-reading exercise. It just so happened to be a couple of miles from where I lived, and I knew it well. Still only got a B though.