Hacker News new | ask | show | jobs
by sriku 3009 days ago
I haven't read the original paper (yet), but something doesn't sit right with the work, if the way it is portrayed is indeed faithful to it and I'm not missing something important.

- It looks like the work of the data scientists will be limited to the extent of the modeling already done by recursive conditional parameter aggregation. (edit: So why not just ship that model and adapt it instead of using it to generate data?)

- Its "validation" appears to be doubly proxied - i.e. the normal performance measures we use are themselves a proxy, and now we're comparing those against these performance measures derived from models built out of the data generated by these models. I'm not inclined to trust a validation that is so removed.

Any one who can explain this well?

2 comments

Just finished the paper, so let me take a stab:

Peeling back the mystery a bit, what is happening is:

1. From each child table upwards, model each column as a simple distribution (e.g. Gaussian) and covariance matrix.

2. Given those child table distribution parameters, pass them back as row values to their respective parent tables.

What you end up with is a "flattened" version of each parent table that has the information (in an "information theoretic" sense) of all child relations. Sampling from distributions is straight forward. The stats methods are outlined in section 3 of the paper.

Things of note:

- The paper makes heavy use of Copula transformations to normalize data whenever it passes around the distribution parameters.

- It deals with missing values by adding something like a dummy column.

- The key insight is that columns must be represented by parameterized distributions, but they don't have to be Gaussian. The Kolmogrov-Smirnov test is used to choose the "best fit" CDF to model.

To your question about the role of the data scientists: they are using the resulting simulations to solve more complex tasks. The goal of the experiment was to see how well the sample data would perform against Kaggle competitions. So I guess the idea was that if winners were indistinguishable, the simple/hierarchical distributions would be considered robust enough for complex tasks. In the end, I'm sure shipping the underlying is preferable for consumers.

(Going through the paper .. a few questions/notes)

Table modeling: While column distributions are picked using the KS-test, the covariance matrix calculation first normalizes the column distributions. Assuming that is reasonable, there is a claim of "this model contains all the information about the original table in a compact way..", but it doesn't account for possible multi-dimensional relationships in the data. It only looks at a series of projections to 2D. Can a d-dimensional dataset (in practice) be effectively summarized by the set of projections on to the d(d-1)/2 two-dimensional subspaces? That's once kind of summary, but I'm unsure whether that is adequate for practical modeling work, especially if folks try to apply high dimensional techniques (DL?) to this. (edit: I feel reasonably sure it isn't adequate. If a column ends up being bi-modal, for example, even that gets lost in translation in this approach?)

Crowdsourced validations: The synthetic sets were generated for already available public datasets. It isn't clear from the paper how any bias resulting from prior familiarity with the public datasets would be accounted for in the study concluding equivalence.

Privacy claims: This is a bit unclear. The "apply random noise" technique seems to suggest something similar to differential privacy, but makes no mention of it. If not DP, what definition of "privacy" is being used here? (I'm ok that proving their algorithm to be privacy safe according to a chosen definition of privacy may be out of scope of the paper.)

(Edit2: I can't help the feeling I have that this paper is an elaborate April fool's joke released early ;)

To the first point, the paper mentions that "the covariance is calculated after applying the Gaussian Copula to that table". The experiments seem to conclude that, for their datasets, the 2D projections seem to work alright. I think that the surprising conclusion is that this works so well for any dataset at all.

Just thinking out loud here:

The typical case where a low dimensional representation would fail you is if you had dependencies (e.g. bimodal relations) that weren't represented by a datatype or foreign key. Recall that the simulation of data still occurs within each table, so the higher the non-represented inter-table dimensionality is, the supplied distributions can measure it. It's might be that, for the most part, the raw columns (not from child tables) have much more bearing on the merit of the table covariance. This seems natural, due to the semantic nature of RDBMS structures.

It's probably an important caveat that typical RDBMS structures are created to optimize the user's understanding of the data through semantic structure. Since the claim of the paper was only that they could provide a useful abstraction for simulation, I think it's OK to proceed with the assumption that Gaussians can never be fully sufficient in modeling highly dimensional data without help.

There are existing non-parametric models that attempt to do a similar thing for relational data that I think are more promising. One drawback of current solutions like BayesDB is that you're still dealing with the original table structure, which this paper tries to get around. It would be nice to bridge the gap for something like PyMC3 where we find a cute way to flatten the data, like this paper.

[1] Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes. https://arxiv.org/pdf/1704.01087.pdf

[2] http://probcomp.csail.mit.edu/bayesdb/

[3] http://docs.pymc.io/index.html

I think they just invented the political representative in modelling
Correct me if I am wrong.

As you note, the Kolmogorov-Smirnov test is used to choose the "best fit" CDFs. The set of CDFs then used to generate a random vector, which after a covariance adjustment becomes a synthetic datapoint.

The step that can ruin the synthetic data is exactly (the "best fit" CDFs) as the original distribution does not necessarily fit well any of the well-known distribution.

At the same time, "best fit" CDFs are responsible for anonymizing the results. So if you overfit and stick to the original data too close, you lose anonymity and capture the original data bias. But if you approximate with a distribution you introduce a distribution bias.

So the solution provides a tradeoff between anonymity and "best fit" corruption of the data.