Hacker News new | ask | show | jobs
by bjoernbu 3206 days ago
Serious question by someone who's not into this kind of stats too much:

What would happen if you took a big set of Facebook profiles and train some (the same if you wanna) CNN to classify picture->f for each f in profile features. Sure, for some features, your model should be able to deliver decent precision.

Does this mean that you quickly found out what features can be predicted from pictures & how well your CNN performs on that? Or is it possible that you just train models from picture->X where X is basically meaningless but significantly correlated with some feature because of the effect portrait in xkcd's "Significant" (Scientists investigate!) [1]

[1]: https://xkcd.com/882/

1 comments

There is a tendency for machine learning (including neural networks) to over-fit data - i.e. the algorithm learns to recognise the particular data, rather than the real distinguishing predictors of the groups. As you say, these can be features that are by chance associated with what you are trying to discriminate.

This is why the model is validated on a separate testing group from the training group which created it. There are lots of ways to do this, and the more sophisticated continually iterate training and testing to improve the model.