| Just finished the paper, so let me take a stab: Peeling back the mystery a bit, what is happening is: 1. From each child table upwards, model each column as a simple distribution (e.g. Gaussian) and covariance matrix. 2. Given those child table distribution parameters, pass them back as row values to their respective parent tables. What you end up with is a "flattened" version of each parent table that has the information (in an "information theoretic" sense) of all child relations. Sampling from distributions is straight forward. The stats methods are outlined in section 3 of the paper. Things of note: - The paper makes heavy use of Copula transformations to normalize data whenever it passes around the distribution parameters. - It deals with missing values by adding something like a dummy column. - The key insight is that columns must be represented by parameterized distributions, but they don't have to be Gaussian. The Kolmogrov-Smirnov test is used to choose the "best fit" CDF to model. To your question about the role of the data scientists: they are using the resulting simulations to solve more complex tasks. The goal of the experiment was to see how well the sample data would perform against Kaggle competitions. So I guess the idea was that if winners were indistinguishable, the simple/hierarchical distributions would be considered robust enough for complex tasks. In the end, I'm sure shipping the underlying is preferable for consumers. |
Table modeling: While column distributions are picked using the KS-test, the covariance matrix calculation first normalizes the column distributions. Assuming that is reasonable, there is a claim of "this model contains all the information about the original table in a compact way..", but it doesn't account for possible multi-dimensional relationships in the data. It only looks at a series of projections to 2D. Can a d-dimensional dataset (in practice) be effectively summarized by the set of projections on to the d(d-1)/2 two-dimensional subspaces? That's once kind of summary, but I'm unsure whether that is adequate for practical modeling work, especially if folks try to apply high dimensional techniques (DL?) to this. (edit: I feel reasonably sure it isn't adequate. If a column ends up being bi-modal, for example, even that gets lost in translation in this approach?)
Crowdsourced validations: The synthetic sets were generated for already available public datasets. It isn't clear from the paper how any bias resulting from prior familiarity with the public datasets would be accounted for in the study concluding equivalence.
Privacy claims: This is a bit unclear. The "apply random noise" technique seems to suggest something similar to differential privacy, but makes no mention of it. If not DP, what definition of "privacy" is being used here? (I'm ok that proving their algorithm to be privacy safe according to a chosen definition of privacy may be out of scope of the paper.)
(Edit2: I can't help the feeling I have that this paper is an elaborate April fool's joke released early ;)