Hacker News new | ask | show | jobs
by cfusting 2372 days ago
The author miss-understands how simulated data is created by GANS, VAEs, and other non-physics based simulations. Let's say you have a dataset and would like to create synthetic data using it and a GAN. Then you wish to estimate the distribution D of the data with a GAN. To do so the GAN learns the joint distribution P(X1, X2, ..., Xn) (where in the image case each X is usually a pixel) such that one may sample from D and obtain a new, synthetic image. Indeed, one will generate novel data but the distribution D that was estimated is merely a description of the original data at best and in practice a little bit (or a lot) off.

Now turn to the machine learning problem we sought to solve with the new synthetic data: what is the P(y|X1, X2, ..., Xn) where y is usually a class like "bird". In other words given an image predict its label. Since the data was generated knowing only the statistics of the original data, it can add no value beyond plausible examples developed using the original data itself.

Will this improve the accuracy of a model by providing additional edge case examples and filling in gaps? Somewhat. Will it understand data not represented by the original data and substitute for more thorough, diverse datasets? Absolutely not.

In terms of model improvement, yes synthetic data can help. In terms of the arms race? No. True examples provide knowledge that is unique. If one used a physics engine (GTA is popular for self-drivings cars) one can gather truly novel data; this is not the case for GANS.

It's concerning how willing people are to write articles on this subject without understanding the mathematics underlying the technology.

Do your homework and RTFM.

1 comments

You are ignoring the fact that generative AI is not closed-loop algorithm. You can synthesize expected features in a data set and feed them to the detector - out of bounds of the generative neural network that rather serves the purpose of mapping into (a subset of) the proper input space.

The power of synthesis is not within the GAN or VAE, it is in the outside mechanism that guides the creation of content with specific domain knowledge about the feature space.

This might not replace the value of real data, but it will allow to accelerate bootstrap, improve coverage (at cost of accuracy), or provide free environments for auxiliary processes like CI/CD in many deep learning applications.

There is a lot of published material on synthetic data augmentation if you actually look for it.

Everything you said doesn't dispute the above comment and agrees with its core premise:

"In terms of model improvement, yes synthetic data can help. In terms of the arms race? No. True examples provide knowledge that is unique. "

I was rather commenting on the first part implying that training a neural network with the statistical distribution that comes out of a GAN or VAE does not add value beyond that generative model capabilities.

I do not agree on that because as I explained, with domain knowledge it is very much possible to shape the data generated for augmented learning - beyond the plain statistical variations of GAN and similar, which are obviously of very limited value in training.