Hacker News new | ask | show | jobs
by sibmike 3009 days ago
Correct me if I am wrong.

As you note, the Kolmogorov-Smirnov test is used to choose the "best fit" CDFs. The set of CDFs then used to generate a random vector, which after a covariance adjustment becomes a synthetic datapoint.

The step that can ruin the synthetic data is exactly (the "best fit" CDFs) as the original distribution does not necessarily fit well any of the well-known distribution.

At the same time, "best fit" CDFs are responsible for anonymizing the results. So if you overfit and stick to the original data too close, you lose anonymity and capture the original data bias. But if you approximate with a distribution you introduce a distribution bias.

So the solution provides a tradeoff between anonymity and "best fit" corruption of the data.