| HN Mirror

A single layer autoencoder with n nodes is equivalent to doing PCA and taking the first n principal components. If you're familiar with PCA in natural language processing, which is called Latent Semantic Analysis (or Indexing), projecting high dimensional data on a lower dimensional surface can actually improve your features. This is because similar words will project onto the same Principal component allowing you to model some semantic information.

Autoencoders with more than 1 layer are more interesting because you end up doing what is essentially non-linear PCA by projecting your data onto a curved manifold. This famous paper, "Reducing the Dimensionality of Data with Neural Networks" [0], by Hinton shows the improvement in how linearly separable documents become once multi-layer autoencoders are used.

The old argument was that unsupervised pretraining helps get proper weights faster, but this has largely been disproven. However, I do believe AEs assist in semi-supervised learning because they project the initial data into a more useful space. As you can seen in the article I linked the projected data are much more linearly separable.

And as a practical evidence: I used a 5 layer AE in the kaggle black box competition [1] to eventually outrank of team of Hinton's grad students. The problem had a larger unsupervised data set with a small number of labels. Using the autoencoders before the MLP ended up nearly doubling our team's score.

[0] https://www.cs.toronto.edu/~hinton/science.pdf [1] https://www.kaggle.com/c/challenges-in-representation-learni...