| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by murbard2 3595 days ago
	What puzzles me with the variational autoencoder is that there is no reason to expect the covariance of p(z\|x) to be diagonal. This sounds like such a crude approximation that there ought to little benefits to even treat it as a distribution rather than a point mass. And yet it seems to do rather well (though not as well as GAN which do represent arbitrary distributions).

2 comments

taliesinb 3595 days ago

VAEs can be extended to make the latent variables dependent. OpenAI's inverse autoregressive flow is one recent way that is particularly efficient: http://arxiv.org/pdf/1606.04934v1.pdf. Linear IAF is the simplest form of this, with it you can model normal z having an arbitrary covariance matrix.

But aside from that, there is an information-theoretic view on why you might prefer VAEs over AEs. In short, having p(z|x) not be point-mass (aka an ordinary AE) allows you to bound the information flow through the bottleneck. KL loss on p(z|x) forces the network to be honest about how much information it is cramming into z for the purposes of reconstruction.

To unpack that a bit: in theory, even a single real-valued latent variable z could store an arbitrary amount of information (if the encoder and decoder conspired cleverly enough). But if you make z stochastic, or in other words if your encoder's job is to calculate the parameters of a distribution from which you sample z, you're essentially introducing a noisy channel in the middle of your network, and you can then bound how much information is flowing across that channel. But to do that you still need to use KL divergence loss to encourage p(z|x) to approximate your chosen latent distribution, otherwise your encoder and decoder might cheat, e.g. by using near-point-mass z as a way to turn back into ordinary AEs again.

Or in deep learning speak, it's a form of regularization with a particularly rich and interpretable statistical motivation.

link

svantana 3595 days ago

IMHO this comment is much better than the original blog post, well done!

link

murbard2 3594 days ago

I get the regularization part, but don't you get essentially the same regularization from using a sparse autoencoder? If the encoder realizes it doesn't have much information, it will turn on few units.

What I don't really intuit is: is it just basically doing regularization, or is the interpretation in terms of learning to infer the posterior meaningful?

link

taliesinb 3594 days ago

> I get the regularization part, but don't you get essentially the same regularization from using a sparse autoencoder? If the encoder realizes it doesn't have much information, it will turn on few units.

Putting a sparsity loss on z in a regular AE will encourage the code to have smaller magnitudes, and with relu those units will tend to saturate to zero, yes.

But the original point was that even a single continuous unit can be used to transmit an arbitrary amount of information. Not so much that this happens in practice, because the encoder and decoder would need access to something like modulo to do the most obvious kinds of cheating, but just that from an information theory point of view you can't really talk about how much information a continuous variable transmits unless you are transmitting it over a noisy channel and can measure entropies of distributions (and indeed you can formally derive how a given KL loss bounds the information transmitted by z).

> What I don't really intuit is: is it just basically doing regularization, or is the interpretation in terms of learning to infer the posterior meaningful?

Both, which I think is really nice. You can look at it either way.

The Bayesian interpretation is powerful because you now have a principled way to calculate p(x), which you didn't have before. And you can introduce multiple latent variables in your network (as long as no layers take inputs from both ordinary and sampling layers) and so you have some flexibility to do limited forms of graphical modelling that supports efficient forward inference and GPU acceleration. And the inference machinery can be trained via cheap backpropagation instead of expensive sampling.

link

phreeza 3595 days ago

Isn't that a desirable feature though? It means your latent features are uncorrelated, which arguably makes them more interpretable? For example you could get gender and hair color instead of (0.5gender + 0.5colo)r and (0.5 gender - 0.5 hair color)

link

murbard2 3594 days ago

They shouldn't be uncorrelated given x. https://en.wikipedia.org/wiki/Berkson%27s_paradox

link