| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Cynddl 3056 days ago

I'm highly dubious of the ability for synthetic data to model accurately datasets without introducing unexpected bias, esp. to account for causality.

If you dig through the original paper, the conclusion is on the line with that:

“For 7 out of 15 comparisons, we found no significant difference between the accuracy of features developed on the control dataset vs. those developed on some version of the synthesized data; that is, the result of the test was False.”

So, on the tests they developed, the proposed method doesn't work 8 times out of 15…

5 comments

hyperbovine 3056 days ago

Agreed, seems suspect. If they are really able to learn the population-level distribution then why even bother generating fake data. Just release that instead.

link

bunderbunder 3056 days ago

Well, just knowing a few distributions wouldn't be great for building machine learning models.

link

bunderbunder 3056 days ago

I'd like to read the paper before drawing such a conclusion. (the link to it seems to be broken)

"for 7 out of 15 comparisons, we found no significant difference" could mean all sorts of things. It could mean that 7 comparisons were perfect and 8 were complete garbage, as you suggest. Or it could mean that 7 comparisons were perfect and 8 had differences that were statistically significant, but the magnitudes of the differences were small enough that the results would still have been perfectly adequate for practical application.

In concrete terms: Let's say the synthetic data lets me build binary classifier that helps with a business issue, and has F1 scores of about 0.8. But if I had access to the real data, I could have got F1 of around 0.85. In that case, I'd happily take the data. As someone who's trying to solve business problems, it would be downright irresponsible of me to reject something that's better than what I currently have on the grounds that it's still less than some unattainable ideal.

link

fardin1368 3056 days ago

You are ignoring the restrictions and regulation that exist around sharing data in lots of financial, government and medical industries. Sometimes, the missed cost of 5 percent accuracy is much less than the inspections, delays and blockages that otherwise would occur if they wanted to use real data.

link

bunderbunder 3056 days ago

I misspoke; I should have said, "I'd happily take the synthetic data."

But yeah, you're right; I was being oversimplistic in just thinking of it in terms of "can have/can't have" and not considering the, "can have, but at too high a cost" angle.

link

nonbel 3056 days ago

I couldn't read the paper (seemed to be missing), but has anyone else noticed that MIT seems to have big problems with open science?

I mean I have formed an association specifically with the MIT brand now, so this type of work coming out of there doesn't surprise me. I couldn't tell you exactly what has lead to this association though.

link

malshe 3056 days ago

Just below what you reproduced, they write:

When we examined the confidence intervals for the remaining 8 tests, we found that for half, the mean of accuracies for features written over synthesized data was higher then for those written on the control dataset.

In other words, for 4 out of remaining 8 cases, the models on synthetic data performed better.

link

Cynddl 3056 days ago

Yes, I did leave that out, as I think it's still an issue. A synthetic model performing better is a little dubious, since the modeled distribution has less information than the original one. Overall, the discrepancy seems more important to notice than the actual performance.

link

stelfer 3056 days ago

Haven't read the paper, but I will.

But I want to comment that it's worked for us. Sequence to sequence learning can reproduce every kind of iid and non-iid things we've ever looked at.

The real question is how safe/anonymous is it really?

link

bunderbunder 3056 days ago

I imagine it depends on how closely you model the conditional probabilities.

If it gets down to correctly modeling the probability of colon cancer diagnosis by age, sex and ZIP code, and also the correct distribution of ages by ZIP code, then that'll be a potential problem in counties that only have one male 87-year-old.

link

stelfer 3056 days ago

I'm talking specifically about modeling iid/non-iid sequences of data from events, experiments, etc. Haven't read the paper so, I'm not sure if I'm talking past the authors or OP.

link