Hacker News new | ask | show | jobs
by Cynddl 3009 days ago
I'm highly dubious of the ability for synthetic data to model accurately datasets without introducing unexpected bias, esp. to account for causality.

If you dig through the original paper, the conclusion is on the line with that:

“For 7 out of 15 comparisons, we found no significant difference between the accuracy of features developed on the control dataset vs. those developed on some version of the synthesized data; that is, the result of the test was False.”

So, on the tests they developed, the proposed method doesn't work 8 times out of 15…

5 comments

Agreed, seems suspect. If they are really able to learn the population-level distribution then why even bother generating fake data. Just release that instead.
Well, just knowing a few distributions wouldn't be great for building machine learning models.
I'd like to read the paper before drawing such a conclusion. (the link to it seems to be broken)

"for 7 out of 15 comparisons, we found no significant difference" could mean all sorts of things. It could mean that 7 comparisons were perfect and 8 were complete garbage, as you suggest. Or it could mean that 7 comparisons were perfect and 8 had differences that were statistically significant, but the magnitudes of the differences were small enough that the results would still have been perfectly adequate for practical application.

In concrete terms: Let's say the synthetic data lets me build binary classifier that helps with a business issue, and has F1 scores of about 0.8. But if I had access to the real data, I could have got F1 of around 0.85. In that case, I'd happily take the data. As someone who's trying to solve business problems, it would be downright irresponsible of me to reject something that's better than what I currently have on the grounds that it's still less than some unattainable ideal.

You are ignoring the restrictions and regulation that exist around sharing data in lots of financial, government and medical industries. Sometimes, the missed cost of 5 percent accuracy is much less than the inspections, delays and blockages that otherwise would occur if they wanted to use real data.
I misspoke; I should have said, "I'd happily take the synthetic data."

But yeah, you're right; I was being oversimplistic in just thinking of it in terms of "can have/can't have" and not considering the, "can have, but at too high a cost" angle.

I couldn't read the paper (seemed to be missing), but has anyone else noticed that MIT seems to have big problems with open science?

I mean I have formed an association specifically with the MIT brand now, so this type of work coming out of there doesn't surprise me. I couldn't tell you exactly what has lead to this association though.

Just below what you reproduced, they write:

When we examined the confidence intervals for the remaining 8 tests, we found that for half, the mean of accuracies for features written over synthesized data was higher then for those written on the control dataset.

In other words, for 4 out of remaining 8 cases, the models on synthetic data performed better.

Yes, I did leave that out, as I think it's still an issue. A synthetic model performing better is a little dubious, since the modeled distribution has less information than the original one. Overall, the discrepancy seems more important to notice than the actual performance.
Haven't read the paper, but I will.

But I want to comment that it's worked for us. Sequence to sequence learning can reproduce every kind of iid and non-iid things we've ever looked at.

The real question is how safe/anonymous is it really?

I imagine it depends on how closely you model the conditional probabilities.

If it gets down to correctly modeling the probability of colon cancer diagnosis by age, sex and ZIP code, and also the correct distribution of ages by ZIP code, then that'll be a potential problem in counties that only have one male 87-year-old.

I'm talking specifically about modeling iid/non-iid sequences of data from events, experiments, etc. Haven't read the paper so, I'm not sure if I'm talking past the authors or OP.