Hacker News new | ask | show | jobs
by YeGoblynQueenne 1436 days ago
>> This leads to one of the key questions of deep learning, currently: Why do neural networks prefer solutions that generalize to unseen data, rather than settling on solutions which simply memorize the training data without actually learning anything?

That's the researchers who prefer these solutions, not the networks. And that's how the networks find them: because the experimenters have access to the test data and they keep tuning their networks' parameterers until they perfectly fit not only the training, but also the _test_ data.

In that sense the testing data is not "unseen". The neural net doesn't "see" it during training but the researchers do and they can try to improve the network's performance on it, because they control everything about how the network is trained, when it stops training etc etc.

It's nothing to do with loss functions and the answers are not in the maths. It's good, old researcher bias and it has to be controlled by othear means, namely, rigorous design _and description_ of experiments.

2 comments

As an addendum, note that training for a competition does not eliminate this overfitting to the test set. Most competitions make the test set instances available, though not their labels. Many restrict the number of submissions that can be made but usually accept several. There's even a bit of jargon regarding how to game this, it's called "hill climbing the test set" (really, it's hill climbing the performance on the test set, i.e. it's the accuracy on the test set that's optimised). Here's an actual how-to:

https://blockgeni.com/how-to-hill-climb-the-test-set-for-mac...

One benchmark I know where the test set is completely hidden is François Chollet's ARC dataset, and that's done precisely to preclude overfitting to the test set.

This explanation is insufficient. Even if it explains good performance on the test data, it does not explain the typically good performance on real world data never seen during the training process.

It appears (somewhat) generalizing models are easier to compute than models that do not generalize at all.

>> Even if it explains good performance on the test data, it does not explain the typically good performance on real world data never seen during the training process.

You'll have to clarify this because I'm not sure what you mean by "real world data". Do you mean e.g. data that is made available after a machine learning system is deployed "live"?

As far as I can tell, nobody really does this kind of "extrinsic" evaluation systematically, first of all because it is very expensive: such "real world data" is unlabelled, and must be labelled before the evaluation.

What's more, the "real world data" is very likely to change between deployments of a machine learning system so any evaluation of a model trained last month may not be valid this month.

So this is all basically very expensive in terms of both money and effort (so, money), and so nobody does it. Instead everyone relies on the approximation of real-world performance on their already labelled datasets.

Yeah I agree that was badly phrased. I meant data points not used in any aspect of the training procedure. Such as a photo taken after the image recognition model have been trained.

It's widely recognized that image recognition models typically perform well also on such data. We don't need to quantify that exactly to conclude that many large (in terms of parameters) models generalize quite well to data neither in the training or the test set.

Provided that the model space is large enough to contain both models that generalize well and models that don't (while still fitting the training data), some explanation why we tend to find generalizing models is required.

Thank you for clarifying!

>> It's widely recognized that image recognition models typically perform well also on such data. We don't need to quantify that exactly to conclude that many large (in terms of parameters) models generalize quite well to data neither in the training or the test set.

I disagree. That is exactly what we need to quantify with great care, precisely because if it were true, an explanation would be needed.

As I say above, and as far as I'm aware, nobody bothers to do this quantification and so any "widely recognised" idea that models generalise to unseen data is only hearsay.