Hacker News new | ask | show | jobs
by Closi 1514 days ago
Yeah but their guess shouldn't be wrong 50% of the time as again that means that they can’t have picked the 95th percentile result! Because it’s 50:50 I’ll assume that they are assigning people scoring higher than average the “under 60” category - which is obviously incorrect. Otherwise how do they pick the cut off?

To explain with another example - let's say that I have a dataset of 100 people's scores at golf (no handicaps) and I know that 5% of them are pro-players and others are 'advanced amateurs'. Because of this I might take the top 5 scores and guess that they are pro's and assign the others the guess of 'advanced amateur'.

Now let's say that there was actually no correlation between people's scores at golf and their 'pro' status - what accuracy would I expect in the above experiment? The answer is actually closer to 90% 'accurate guesses' than 50%! (Although obviously - that's 90% accurate based on random chance).

Now if someone told me they got 50% of the guesses wrong at this task, that implies that they guessed that the top 50% of those golfers were pro rather than picking the top 5% of scores, and I would question the methodology.

This % is similar to the dataset in the webpage - I downloaded it, filtered out exclusions and c4% of the valid responses are 60 or over.

If I inherently pick a small population (i.e. over 60's are c4% in this dataset) and I am guessing wrong 50% of the time, it means that my cut-off is incorrectly calibrated. Their score cut-off should, at worst, be picking the wrong 4% and missing another 4%.

Am I going crazy? It seems logical to me, but to be open maths isn't my strong point. I just know that if I designed the guessing rule, I would be getting more than 50% (my algorithm would be 'if the users average score across the three tests is less than -1.5, assign 'over 60' and that would get c95% accurate guesses, albeit it would still not prove anything and I agree with the authors overall premise!).

1 comments

In your golf example, making that guess requires an additional knowledge of what "pro" means and it's frequency among golfers. The data doesn't know that just like the randomness data doesn't know that most humans are younger than 65 years old. If you really want to figure out how predictive the data is, you shouldn't include considerations like that in your model. I get what you're saying but ultimately I don't think their goal was to make the most accurate prediction, they wanted to make one that illustrated their point by basing their guess off the data alone.
The calculation involves knowing the age of the sample population though (if you don’t know the ages of your sample, how do you work out what the cut off is at 60 years?).

If I don’t know how many golfers are pro, I simply cannot estimate if it is 100 golfers that are pro or 0 (unless it’s a real gap in scores). Making an assumption that 50 are pro is no more valid than 0 or 100.

If you take the average score of 100 people and say that you estimate anyone scoring below the average is above 60, you are going to be wrong regardless of if your hypothesis is valid or not.

Putting that up and saying “see, it’s wrong 50% of the time!” doesn’t make sense when your calculation is incorrect.

In order to calculate the cut-off correctly they either need to take the 95th percentile result, or pick a sample where 50% of people are over-60 and 50% are under 60 and take an average of that.

Using a dataset where 95% of people are under 60 and then picking the mean clearly isn’t going to work.