Hacker News new | ask | show | jobs
by sdenton4 1731 days ago
My big beef with a lot of the 'leading edge' ML research is that it tends to be waaaaay too focused on classification problems, and ImageNet in particular. And, last I checked, you /do/ still fight with overfitting in classification models, by cleverly choosing learning rate schedules and using early stopping schemes, 'double descent' be damned.

You can solve classification with a hash function: Hash the image, and then just memorize which label goes with which hash. You can try to dodge this obviously dodgy solution by adding augmentation to the dataset. Then you instead learn to find a representation invariant under the set of augmentations, and learn the hash of that representation. It turns out these augmentation-invariant representations are actually pretty good, so we can solve the classification problem in what looks like a general way.

However, there are many other classes of problems where the hash problem doesn't exist, because the information density of the outputs is too high to memorize in the same way. Specifically, generative models, and the sorts of predictive/infill problems used for self-supervision. In these spaces, the problems are more like: "Given this pile of augmented input, generate half a megabyte of coherent output." These kinds of problems simply don't overfit: Train a speech separation model on a big dataset, and the train+eval quality metrics will just asymptote their way up and to the right until you run out of training budget.

2 comments

Memorization is only an issue if you allow it to be. If design the model with a "narrow" enough inner stage then that limits the level of detail (in terms of distinct representable values) passed to subsequent stages. This should give you an ML algorithm that consists of a fingerprint (approximates your hashing) stage followed by a classifier that works based on the fingerprint input (approximates a table lookup). Such an algorithm should not have such a problem with over-fitting was you describe.
Memorization is only an issue if you allow it to be.

Sure, it's a potential problem that can appear in the process implementing a deep learning solution. It's not an insurmountable problem. But the fact that still appears seems like an indication the situation in deep learning is more complicated than "overparameterization is not a problem".

When you say hash function, do you mean a cryptographic hash function? How on earth could the performance of that be anywhere near the simplest probabilistic algorithm on unseen examples?
No, nothing cryptographic here. All I'm saying is that you can memorize the dataset by extracting a small fingerprint of each training example and associating it with an output label: ie, learn by lookup table. Then you don't need to memorize the whole training set, you just need to find/learn the fingerprinting function. With no augmentation, you might as well use MD5... With augmentation, you do need to do some actual work to learn to extract an augmentation invariant projection of the training examples, but the basic principle is the same.
I have nothing to do with machine learning but it seems like the hashing approach would only work if you are “training” on the evaluation set instead of a separate training set. Afaik in image net like challenges the set of labeled training images does not contain any of the evaluation images so there wouldn’t be any hashes matching any of the evaluation data.
Yes, you're right. You should never see the test/evaluation dataset during training so it would be impossible to "memorize" the test cases. You would get good near perfect accuracy on the training data, but not the test set. I think the closest analogue would be models that produce conceptual embeddings somewhere in them -- those are kind of like hashes with the property that similar things have similar embeddings. Many classification neural networks kind of operate like that -- the initial layers produce a representation of the data and then the final layer actually performs the classification.
Err.. hash functions like MD5 and SHA256 are "cryptographic". That just means one with a random distribution of outputs as opposed to maybe Apple's "neural" hash function which has outputs that do the "augmentation invariant projection" you speak of.

What I'm trying to say is that neural networks are "universal approximators of continuous real functions". You can think of them as finding the curve of a function which matches the data to an expected and they get their predictive power by matching the underlying "function" of the problem.

Applying a cryptographic hash function is like completely scrambling the underlying function. The only way for a neural network to match it is if it was somehow a universal approximator of a discontinuous real function. You can either do that by getting into unexplored chaos theory or making a gigantic lookup table for every single possible bit combination. The former no human being knows how to do, and the latter is impossible for even a 64 bit combination (nevermind an entire image, audio clip, or video).

>> making a gigantic lookup table for every single possible bit combination

You don't need this to achieve zero loss on the training set, though: You only need a lookup table for the images in the train set.

We know that neural networks can do something like this (learning the lookup table) because large networks can get to zero training loss on randomly assigned labels. (I linked the paper a bit further down in the thread.) This means there's some memorization capability in the architecture, even if it's a weird emulation of some memorization strategy that we would consider easy.

The actual mechanism here is probably closer to random projection + nearest neighbor; NNs are not obviously learning crypto functions. But they /are/ learning some kind of lookup mechanism. There's some indication (see Sara Hooker's work) that in practice they use a mixture of 'reasonable' strategies and memorization for long-tail training examples. We don't know /how much/ the leading networks trained on real labels rely on memorization because we don't have any real insight into the learned structures.

(as an aside, we train neural networks for discontinuous functions all the time: Classification is discontinuous, by the nature of the labels. We turn it into a continuous+trainable problem by choosing a probabilistic framing.)

Okay but that would only work for examples with which you already have. All interesting cases of neural networks are applying it to unseen inputs. How does your technique work with unseen inputs?

And while we interpret the result of a classification as a 1 or 0, the underlying result is a continuous probability. Even in reality, our training examples are labeled with too much confidence - some labels are vague even for humans. If it approximates a discontinuous function, then it does so by approximating a continuous function. You can read here for more information: https://www.sciencedirect.com/science/article/abs/pii/089360...

Yes, this is the point: When we train a neural network, especially on a classification problem, it has multiple avenues to solve the problem. We know they are capable of ineffectual memorization, as well as some other less ridiculous things. When we train, it's not clear what mix we're getting of 'neural hashing' vs learning abstracted features.

My point up above is that classification problems are too weak, exactly because these kinds of shortcuts are readily available. The leading edge of ML research is over-focused on ImageNet classification in particular.

Has this been implemented? What kinds of hashing functions are you talking about? How would you guarantee the same hash for all the augmentations?

It seems like the approach you describe just moves the complexity of the task solved by neural networks into the hashing function.

https://openreview.net/forum?id=Sy8gdB9xx

"our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise."

What's your point? It's not clear at all from this post.
Fuzzy hashing is a thing and may actually use cryptographic hashing internally.

See for example recent work on hashing for malware detection.