Hacker News new | ask | show | jobs
by bunderbunder 3009 days ago
I'd like to read the paper before drawing such a conclusion. (the link to it seems to be broken)

"for 7 out of 15 comparisons, we found no significant difference" could mean all sorts of things. It could mean that 7 comparisons were perfect and 8 were complete garbage, as you suggest. Or it could mean that 7 comparisons were perfect and 8 had differences that were statistically significant, but the magnitudes of the differences were small enough that the results would still have been perfectly adequate for practical application.

In concrete terms: Let's say the synthetic data lets me build binary classifier that helps with a business issue, and has F1 scores of about 0.8. But if I had access to the real data, I could have got F1 of around 0.85. In that case, I'd happily take the data. As someone who's trying to solve business problems, it would be downright irresponsible of me to reject something that's better than what I currently have on the grounds that it's still less than some unattainable ideal.

1 comments

You are ignoring the restrictions and regulation that exist around sharing data in lots of financial, government and medical industries. Sometimes, the missed cost of 5 percent accuracy is much less than the inspections, delays and blockages that otherwise would occur if they wanted to use real data.
I misspoke; I should have said, "I'd happily take the synthetic data."

But yeah, you're right; I was being oversimplistic in just thinking of it in terms of "can have/can't have" and not considering the, "can have, but at too high a cost" angle.