|
|
|
|
|
by murbard2
3595 days ago
|
|
What puzzles me with the variational autoencoder is that there is no reason to expect the covariance of p(z|x) to be diagonal. This sounds like such a crude approximation that there ought to little benefits to even treat it as a distribution rather than a point mass. And yet it seems to do rather well (though not as well as GAN which do represent arbitrary distributions). |
|
But aside from that, there is an information-theoretic view on why you might prefer VAEs over AEs. In short, having p(z|x) not be point-mass (aka an ordinary AE) allows you to bound the information flow through the bottleneck. KL loss on p(z|x) forces the network to be honest about how much information it is cramming into z for the purposes of reconstruction.
To unpack that a bit: in theory, even a single real-valued latent variable z could store an arbitrary amount of information (if the encoder and decoder conspired cleverly enough). But if you make z stochastic, or in other words if your encoder's job is to calculate the parameters of a distribution from which you sample z, you're essentially introducing a noisy channel in the middle of your network, and you can then bound how much information is flowing across that channel. But to do that you still need to use KL divergence loss to encourage p(z|x) to approximate your chosen latent distribution, otherwise your encoder and decoder might cheat, e.g. by using near-point-mass z as a way to turn back into ordinary AEs again.
Or in deep learning speak, it's a form of regularization with a particularly rich and interpretable statistical motivation.