Hacker News new | ask | show | jobs
by sendtown_expwy 1727 days ago
You are incorrect about the input dimensionality mattering. Let's say you have 100 high-res images with yes/no labels. If you hash the images and put their labels in a hashmap, you can say this is a "learned" function of 100 parameters which achieves zero training error on the dataset. This parameter count is independent of input dimension. Why do you think this would change when this mapping is replaced by a smooth neural network mapping?

GPT is trained to predict the input (estimating p(x)), versus predicting a label given an input (p(y|x)). So in the case of GPT you can use the input dimensionality as a "label", as another responder has mentioned. ImageNet classification is different (excepting recent semi-supervised or unsupervised approaches to image recognition).

The ability to generalize in the typical imagenet setting is, as the article says, a byproduct of SGD with early stopping, which in practice limits the number of functions a deep neural network can express (something not considered in an analysis which only considers parameter count).

1 comments

The point is your simple mapping with zero error on the training dataset also has zero prediction power in both the test dataset and in real life. It's learned nothing; it's at the extreme scale of overfitted.

Input dimensionality is absolutely important when determining net size.

Seems like cross talking to me. They were responding to the erroneous claim of "input dimensonlity" being equivalent to data. What the first poster referred to as "internal data points" may be better described as the presumption of differentiability, that is, a small disturbance of the pixels should result in a "small" change of the labels. But it was ridiculous to claim somehow the total number of pixels is a meaningful measure of sample size. The pixels are not independent, as dramatized by the hash map example given above.
That's the point. 100 parameters is sufficient to overfit, and it's a number that's independent of the input size. Do you have a reference for your statement?
Reference for what exactly? That input dimensionality is important when determining net size? That seems quite self-explanatory; try training a image classifier with only 100 parameters.

Maybe I understood that question wrong, but regardless, even if early stopping wasn't implemented, a NN would have more predictive power than the hash mapping. Both would be completely overfit on the training data set, yet the NN would most likely be able to make some okay guesses with OOD data.