|
|
|
|
|
by cfusting
2372 days ago
|
|
The author miss-understands how simulated data is created by GANS, VAEs, and other non-physics based simulations. Let's say you have a dataset and would like to create synthetic data using it and a GAN. Then you wish to estimate the distribution D of the data with a GAN. To do so the GAN learns the joint distribution P(X1, X2, ..., Xn) (where in the image case each X is usually a pixel) such that one may sample from D and obtain a new, synthetic image. Indeed, one will generate novel data but the distribution D that was estimated is merely a description of the original data at best and in practice a little bit (or a lot) off. Now turn to the machine learning problem we sought to solve with the new synthetic data: what is the P(y|X1, X2, ..., Xn) where y is usually a class like "bird". In other words given an image predict its label. Since the data was generated knowing only the statistics of the original data, it can add no value beyond plausible examples developed using the original data itself. Will this improve the accuracy of a model by providing additional edge case examples and filling in gaps? Somewhat. Will it understand data not represented by the original data and substitute for more thorough, diverse datasets? Absolutely not. In terms of model improvement, yes synthetic data can help. In terms of the arms race? No. True examples provide knowledge that is unique. If one used a physics engine (GTA is popular for self-drivings cars) one can gather truly novel data; this is not the case for GANS. It's concerning how willing people are to write articles on this subject without understanding the mathematics underlying the technology. Do your homework and RTFM. |
|
The power of synthesis is not within the GAN or VAE, it is in the outside mechanism that guides the creation of content with specific domain knowledge about the feature space.
This might not replace the value of real data, but it will allow to accelerate bootstrap, improve coverage (at cost of accuracy), or provide free environments for auxiliary processes like CI/CD in many deep learning applications.
There is a lot of published material on synthetic data augmentation if you actually look for it.