Hacker News new | ask | show | jobs
by nomailing 3681 days ago
Could you please elaborate on this. I would really like to know if autoencoders are still useful for classification if I have only labels for a small part of my training data. Is unsupervised pretraining still useful or was it completely replaced by other techniques as the article somehow seems to suggest?
1 comments

A single layer autoencoder with n nodes is equivalent to doing PCA and taking the first n principal components. If you're familiar with PCA in natural language processing, which is called Latent Semantic Analysis (or Indexing), projecting high dimensional data on a lower dimensional surface can actually improve your features. This is because similar words will project onto the same Principal component allowing you to model some semantic information.

Autoencoders with more than 1 layer are more interesting because you end up doing what is essentially non-linear PCA by projecting your data onto a curved manifold. This famous paper, "Reducing the Dimensionality of Data with Neural Networks" [0], by Hinton shows the improvement in how linearly separable documents become once multi-layer autoencoders are used.

The old argument was that unsupervised pretraining helps get proper weights faster, but this has largely been disproven. However, I do believe AEs assist in semi-supervised learning because they project the initial data into a more useful space. As you can seen in the article I linked the projected data are much more linearly separable.

And as a practical evidence: I used a 5 layer AE in the kaggle black box competition [1] to eventually outrank of team of Hinton's grad students. The problem had a larger unsupervised data set with a small number of labels. Using the autoencoders before the MLP ended up nearly doubling our team's score.

[0] https://www.cs.toronto.edu/~hinton/science.pdf [1] https://www.kaggle.com/c/challenges-in-representation-learni...

Thank you for the answer. That makes very much sense.

Just a side note: As far as I know a single layer autoencoder and PCA are only equivalent if all units have no activation function (linear activation function), which is usually not the case.

"The old argument was that unsupervised pretraining helps get proper weights faster, but this has largely been disproven."

Do you hold that to be true in general, or only when using dropout?