Hacker News new | ask | show | jobs
by duchenne 2551 days ago
Overfitting typically happens when the number of training samples is small compared to the number of parameters of the model. It especially happens when the input space is high-dimensional.

Here the input space has a only one dimension. The model has 10 parameters. There are 8500 samples. So, the author safely assumes that no over-fitting occurs.

2 comments

Wrong. You are assuming a good selection over the whole search space of humans. If you took HN I bet you would get a different distribution. The Reddit data only applies to sampling Redditors sampled using similar strategies, the generalization is limited by how general the collection strategy was
Yes, there is a possibility that the sample is biased. But that does not matter for this experiment.

The OP says that the model should be built with people in the same room than him, and then the human RNG should be executed with the same people.

As a consequence, OP has to capture the bias of the people in his room to make his RNG work.

The model would break only if the people in the sample change their strategy to pick random numbers after some time.

The data from Reddit is just used as an example of how to make a human RNG from redditer random picks.

As long as the training and testing are done from the same group of people, there is no issue, even if this group is biased.

Moreover, I don't see any prior reason to believe that redditors and HN readers would have different biases. That would be an interesting experiment, though. However, the predominance of the number 7 in article's data might be a western culture thing.

This doesn't exclude overfitting. You can take samples n much larger than parameters p in, for example, a political or household earnings survey performed in the San Francisco Bay Area, but obviously one cannot safely assume no overfitting occurs in that scenario.