Hacker News new | ask | show | jobs
by maximeago 1554 days ago
We developed our own generative model for synthetic data generation. It is an autoregressive model where each variable/attribute is derived from previously generated ones using Transformers networks (more details there: https://arxiv.org/pdf/2202.02145.pdf). So yes, correlations are modelled, although exact multicollinearity (when there is a linear relationship between bunch of attributes) would be a bit blurry in the synthetic data.

This being said, the goal of Sarus is to enable analysis on the original data with privacy guarantee on the result (synthetic data is merely used as a tool and a fallback when there is no better solution) so you can write a statistical test to detect multicollinearity and run it on the original data within Sarus.