Hacker News new | ask | show | jobs
by ibarea277 2309 days ago
This just seems like something that will catastrophically fail. If you can build a good enough generator you can just build the ML model internally. And if you can't the statistics of what you provide are going to be off enough that any strong model is going to be wrong in strange ways.
4 comments

This is an incredibly important point: in order for your synthetic data to be useful your simulator must have already solved the problem at hand. In theory there is no need to even fool around with generating the synthetic data and going through the charade of training a model on it; simply exact the outcome model from your simulator directly as that's implicitly what you are doing. For example, if you have a generative model that provides densities, you can simply compute P(Y | X) = P(X, Y) / P(X).
But this is not how generators work. They generally produce samples in the from

G: Q -> (X,Y)

where Q is some prior from which you are sampling. If they are not invertible then you straight up cannot get P(X,Y) out of the generator. Even if it is invertible getting P(X) requires integrating out the Y which might be infeasible (since the model is not integrable and is sufficiently fast changing that you need very, very many samples).

Mathematically valid but misses the business problem :)
Very true. If you've solved the labeling/extraction problem using a means other than ML, you can use that means to generate synthetic data. The situation at my company is exactly this.

Say you use regular expressions to extract sensitive data from standardized, but numerously varied, form documents. The pieces of information extracted are very common classes of data: first name, last name, dates, physical locations.

During the extraction process you can save the complement of the extraction (the "leftovers") and insert generated data at the extraction points. Also, because you've extracted the actual sensitive data, you can exclude that from the set of values used for generation, if it's practical.

Sometimes people get caught up in the math and theory that they fail to see the practical solutions.

I agree that this is very tricky. I think the most interesting synthetic healthcare data generation I saw was using causal inference (where SMEs can bake in a bunch of expert knowledge during skeleton construction) and then generated data by getting the weights on the edges from a smaller dataset. At the same time, it is very hard to ensure that you synthetic dataset actually reflects real world. On one hand SME knowledge might give extra oomph to synthetic data generation (as this knowledge is equivalent to some highly abstracted training) but also if the "expert knowledge" is wrong then it's a recipe for disaster.
Fraud modeling, regulatory requirements, economic data share (e.g. internal firm data) all represent potential use cases.
You can still pretrain on synthetic data and finetune on a smaller dataset of real data.