Hacker News new | ask | show | jobs
by leecarraher 1553 days ago
for synthetic data generation, what methods are they using to sample data from the distribution? What assumptions about the distribution are being made? Does it model correlations between sample attributes that could adversely effect some ML methods (multi-colinearity can cause problems).
1 comments

We developed our own generative model for synthetic data generation. It is an autoregressive model where each variable/attribute is derived from previously generated ones using Transformers networks (more details there: https://arxiv.org/pdf/2202.02145.pdf). So yes, correlations are modelled, although exact multicollinearity (when there is a linear relationship between bunch of attributes) would be a bit blurry in the synthetic data.

This being said, the goal of Sarus is to enable analysis on the original data with privacy guarantee on the result (synthetic data is merely used as a tool and a fallback when there is no better solution) so you can write a statistical test to detect multicollinearity and run it on the original data within Sarus.