Hacker News new | ask | show | jobs
by brudgers 4420 days ago
Data gone Wrong

Did they start hanging out with the bad kids, take up cigarettes, drinking, gambling only to progress to crack and burglaries one of which ended with our Data shooting a home owner who returned unexpectedly?

I guess I don't understand what data is. I always thought it was a set of values. And I always thought that the problem when using data was in the interpretation, and that a prudent consumer of data would always be careful to distinguish between a random sample and self-selecting sample when drawing conclusions, and then would only state conclusions couched in the language of statistical inference.

Leaving aside the question of why I should give a fuck about this supposed outrage, why does the author expect there to be a strong correlation between movie quality and the ratings on a website devoted to providing entertainment by having users rate movies?

When The Matrix is purported to be better a better movie than Lawrence of Arabia, the problems of interpretation are systemic.

2 comments

Everything you say is right, of course. Yet, I upvoted the story.

I thought it was indicative of a larger trend where crowdsourced data are used to illustrate a point. Like the Google flu trends articles, which have gone around HN at least twice, once when they were successful (https://news.ycombinator.com/item?id=5040204) and once when they were critiqued (e.g., https://news.ycombinator.com/item?id=7455307).

I work a lot with sampled data, and I have found that sampling issues can be some of the most difficult to appreciate and to quantify -- even for experts.

I guess it comes down to sampling from one distribution, P(x), when the situation you really care about samples according to a different distribution P'(x). If P is far from P', your conclusions from P can be arbitrarily bad. If you have an adversary moving P around deliberately, as here, it's even worse.

Statistics experts are fewer and further between than experts in other fields who use statistics to justify their decisions, and the article shows how far off base most people are...after all the author conducted numerical analysis of the database and presents their findings as facts about data and includes a rough statistical comparison of the voting patterns of the lowest rated [called 'worst'] and the second lowest rated movies.

If there is an interesting statistical result it's that the movie's rating is entirely consistent with crowd sourced predictions. The theory is that 'wisdom of crowds' results directly from diversity among those making predictions.[1] In the case of the lowest rated movie, those making predictions were unusually homogeneous, and therefore an inaccurate prediction as to the quality is unsurprising.

Again, it's all in the interpretation, e.g. there's statistical evidence that a lot of morons ranked the The Matrix.

[1] Diversity Prediction Theorem: http://vserver1.cscs.lsa.umich.edu/~spage/ONLINECOURSE/predi...

When Data Bites Back. Good points, dude.