| HN Mirror

To the first point, the paper mentions that "the covariance is calculated after applying the Gaussian Copula to that table". The experiments seem to conclude that, for their datasets, the 2D projections seem to work alright. I think that the surprising conclusion is that this works so well for any dataset at all.

Just thinking out loud here:

The typical case where a low dimensional representation would fail you is if you had dependencies (e.g. bimodal relations) that weren't represented by a datatype or foreign key. Recall that the simulation of data still occurs within each table, so the higher the non-represented inter-table dimensionality is, the supplied distributions can measure it. It's might be that, for the most part, the raw columns (not from child tables) have much more bearing on the merit of the table covariance. This seems natural, due to the semantic nature of RDBMS structures.

It's probably an important caveat that typical RDBMS structures are created to optimize the user's understanding of the data through semantic structure. Since the claim of the paper was only that they could provide a useful abstraction for simulation, I think it's OK to proceed with the assumption that Gaussians can never be fully sufficient in modeling highly dimensional data without help.

There are existing non-parametric models that attempt to do a similar thing for relational data that I think are more promising. One drawback of current solutions like BayesDB is that you're still dealing with the original table structure, which this paper tries to get around. It would be nice to bridge the gap for something like PyMC3 where we find a cute way to flatten the data, like this paper.

[1] Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes. https://arxiv.org/pdf/1704.01087.pdf

[2] http://probcomp.csail.mit.edu/bayesdb/

[3] http://docs.pymc.io/index.html