|
|
|
|
|
by sriku
3008 days ago
|
|
(Going through the paper .. a few questions/notes) Table modeling: While column distributions are picked using the KS-test, the covariance matrix calculation first normalizes the column distributions. Assuming that is reasonable, there is a claim of "this model contains all the information about the original table in a compact way..", but it doesn't account for possible multi-dimensional relationships in the data. It only looks at a series of projections to 2D. Can a d-dimensional dataset (in practice) be effectively summarized by the set of projections on to the d(d-1)/2 two-dimensional subspaces? That's once kind of summary, but I'm unsure whether that is adequate for practical modeling work, especially if folks try to apply high dimensional techniques (DL?) to this. (edit: I feel reasonably sure it isn't adequate. If a column ends up being bi-modal, for example, even that gets lost in translation in this approach?) Crowdsourced validations: The synthetic sets were generated for already available public datasets. It isn't clear from the paper how any bias resulting from prior familiarity with the public datasets would be accounted for in the study concluding equivalence. Privacy claims: This is a bit unclear. The "apply random noise" technique seems to suggest something similar to differential privacy, but makes no mention of it. If not DP, what definition of "privacy" is being used here? (I'm ok that proving their algorithm to be privacy safe according to a chosen definition of privacy may be out of scope of the paper.) (Edit2: I can't help the feeling I have that this paper is an elaborate April fool's joke released early ;) |
|
Just thinking out loud here:
The typical case where a low dimensional representation would fail you is if you had dependencies (e.g. bimodal relations) that weren't represented by a datatype or foreign key. Recall that the simulation of data still occurs within each table, so the higher the non-represented inter-table dimensionality is, the supplied distributions can measure it. It's might be that, for the most part, the raw columns (not from child tables) have much more bearing on the merit of the table covariance. This seems natural, due to the semantic nature of RDBMS structures.
It's probably an important caveat that typical RDBMS structures are created to optimize the user's understanding of the data through semantic structure. Since the claim of the paper was only that they could provide a useful abstraction for simulation, I think it's OK to proceed with the assumption that Gaussians can never be fully sufficient in modeling highly dimensional data without help.
There are existing non-parametric models that attempt to do a similar thing for relational data that I think are more promising. One drawback of current solutions like BayesDB is that you're still dealing with the original table structure, which this paper tries to get around. It would be nice to bridge the gap for something like PyMC3 where we find a cute way to flatten the data, like this paper.
[1] Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes. https://arxiv.org/pdf/1704.01087.pdf
[2] http://probcomp.csail.mit.edu/bayesdb/
[3] http://docs.pymc.io/index.html