Hacker News new | ask | show | jobs
by lalaland1125 2309 days ago
This is an incredibly important point: in order for your synthetic data to be useful your simulator must have already solved the problem at hand. In theory there is no need to even fool around with generating the synthetic data and going through the charade of training a model on it; simply exact the outcome model from your simulator directly as that's implicitly what you are doing. For example, if you have a generative model that provides densities, you can simply compute P(Y | X) = P(X, Y) / P(X).
2 comments

But this is not how generators work. They generally produce samples in the from

G: Q -> (X,Y)

where Q is some prior from which you are sampling. If they are not invertible then you straight up cannot get P(X,Y) out of the generator. Even if it is invertible getting P(X) requires integrating out the Y which might be infeasible (since the model is not integrable and is sufficiently fast changing that you need very, very many samples).

Mathematically valid but misses the business problem :)
Very true. If you've solved the labeling/extraction problem using a means other than ML, you can use that means to generate synthetic data. The situation at my company is exactly this.

Say you use regular expressions to extract sensitive data from standardized, but numerously varied, form documents. The pieces of information extracted are very common classes of data: first name, last name, dates, physical locations.

During the extraction process you can save the complement of the extraction (the "leftovers") and insert generated data at the extraction points. Also, because you've extracted the actual sensitive data, you can exclude that from the set of values used for generation, if it's practical.

Sometimes people get caught up in the math and theory that they fail to see the practical solutions.