| HN Mirror

True? But I wouldn’t call creating new data from non-anonymous data “making data anonymous”. Instead, that’s new random data whose values are constrained or based on real-world data. I’d call that newly generated data, not anonymized data.

To me, anonymized data has an inherent risk of leaking the original transaction because it is a one-to-one mapping of the original data. If you generate new data, it will by definition diverge from the production dataset in some way that might be unrealistic. For example, fields with address components might not actually point to real places, or might not be written the same way as they would be in production. Perhaps a portion of production data includes international addresses or rural routes that your software might fail to generate, or worse, maybe it would generate them incorrectly.

Frankly, generating data is a better approach than anonymized data. And I know of anonymization techniques where good data is mixed with bad data and statistically, the bad data can be filtered out later but only in aggregate, etc. But I’m drawing a line in the sand between anonymized data that closely matches real data, and that which is “generated data”, because you can still potentially learn from the anonymized data but you can’t learn from generated data much more than you would from the initial model that created the constraints used to generate the data. I’m probably explaining this poorly, it’s a bit late at night in my time zone. :)