One thing to note is that here noise != Gaussian iid noise, so these are not typical white noise images. I think we were not really clear on that part, but for us noise is basically a random process, which takes a seed as input (plus potentially some very low-level assumptions over image statistics, such as a 1/f spectrum) and produces a synthetic image.
It is then possible to generate arbitrary amounts of these images as samples from the stochastic process - these images exhibit certain image-like structures (such as oriented edges), but are as a whole still random and extremely varied, which is good and necessary for the representation learning.
In terms of helping, though, it is important to note that we do not achieve state-of-the-art performance yet, and when looking at absolute performance for a task like image classification, using real images is still better. That being said, something that is in the paper but generally seems to get lost is that our representations work very well when analyzing data that is very different from normal images, such as medical images or satellite images.
But then you aren't really throwing "random noise" at it are you? It's more like you are throwing generated data sets with abstract structures at it, and use the randomization part to ensure that it does not overfit on other accidental structures that might be in an individual image, because the randomization ensures that there are no other structures to speak of in the "average" (which does sound like a very sensible way to train a network on abstract structures). Or do I misunderstand the method here?
Oh for sure, and I don't mean to accuse the authors; if pop-sci articles spread confusion about their work that's not their fault. I just want to clear things up for myself
It's not random noise. Look at the images in the paper. Horizontal lines, vertical lines, snakeskin patterns, Minecraft textures. Examples of miscellaneous surface patterns, in other words.
Back before deep learning, people used to make recognizers for features like that as a lower level of feature recognition. Now it's expected that features will be derived automatically from real imagery. This is kind of a return to that level.
A useful training set might be a big texture library used for game development or animation. Those are easily available.
That would indeed be an interesting thing to try, use real data, but only in terms of textures - so effects like occlusions, perspective, etc. would not be present.
I would expect it to be somewhere in the ballpark of our StyleGAN images, which also look very "textural", but lack these effects that are an result of imaging the 3D world. Interestingly, modelling these effects without realistic textures seems to result in worse performance - this is for example the case for images taken from CLEVR or generated from Minecraft, and both perform worse than the StyleGAN images.
It is then possible to generate arbitrary amounts of these images as samples from the stochastic process - these images exhibit certain image-like structures (such as oriented edges), but are as a whole still random and extremely varied, which is good and necessary for the representation learning.
In terms of helping, though, it is important to note that we do not achieve state-of-the-art performance yet, and when looking at absolute performance for a task like image classification, using real images is still better. That being said, something that is in the paper but generally seems to get lost is that our representations work very well when analyzing data that is very different from normal images, such as medical images or satellite images.