Hacker News new | ask | show | jobs
by sohdas 1509 days ago
In your golf example, making that guess requires an additional knowledge of what "pro" means and it's frequency among golfers. The data doesn't know that just like the randomness data doesn't know that most humans are younger than 65 years old. If you really want to figure out how predictive the data is, you shouldn't include considerations like that in your model. I get what you're saying but ultimately I don't think their goal was to make the most accurate prediction, they wanted to make one that illustrated their point by basing their guess off the data alone.
1 comments

The calculation involves knowing the age of the sample population though (if you don’t know the ages of your sample, how do you work out what the cut off is at 60 years?).

If I don’t know how many golfers are pro, I simply cannot estimate if it is 100 golfers that are pro or 0 (unless it’s a real gap in scores). Making an assumption that 50 are pro is no more valid than 0 or 100.

If you take the average score of 100 people and say that you estimate anyone scoring below the average is above 60, you are going to be wrong regardless of if your hypothesis is valid or not.

Putting that up and saying “see, it’s wrong 50% of the time!” doesn’t make sense when your calculation is incorrect.

In order to calculate the cut-off correctly they either need to take the 95th percentile result, or pick a sample where 50% of people are over-60 and 50% are under 60 and take an average of that.

Using a dataset where 95% of people are under 60 and then picking the mean clearly isn’t going to work.