|
|
|
|
|
by sriku
3009 days ago
|
|
I haven't read the original paper (yet), but something doesn't sit right with the work, if the way it is portrayed is indeed faithful to it and I'm not missing something important. - It looks like the work of the data scientists will be limited to the extent of the modeling already done by recursive conditional parameter aggregation. (edit: So why not just ship that model and adapt it instead of using it to generate data?) - Its "validation" appears to be doubly proxied - i.e. the normal performance measures we use are themselves a proxy, and now we're comparing those against these performance measures derived from models built out of the data generated by these models. I'm not inclined to trust a validation that is so removed. Any one who can explain this well? |
|
Peeling back the mystery a bit, what is happening is:
1. From each child table upwards, model each column as a simple distribution (e.g. Gaussian) and covariance matrix.
2. Given those child table distribution parameters, pass them back as row values to their respective parent tables.
What you end up with is a "flattened" version of each parent table that has the information (in an "information theoretic" sense) of all child relations. Sampling from distributions is straight forward. The stats methods are outlined in section 3 of the paper.
Things of note:
- The paper makes heavy use of Copula transformations to normalize data whenever it passes around the distribution parameters.
- It deals with missing values by adding something like a dummy column.
- The key insight is that columns must be represented by parameterized distributions, but they don't have to be Gaussian. The Kolmogrov-Smirnov test is used to choose the "best fit" CDF to model.
To your question about the role of the data scientists: they are using the resulting simulations to solve more complex tasks. The goal of the experiment was to see how well the sample data would perform against Kaggle competitions. So I guess the idea was that if winners were indistinguishable, the simple/hierarchical distributions would be considered robust enough for complex tasks. In the end, I'm sure shipping the underlying is preferable for consumers.