| HN Mirror

> This might be an easier approach than creating fake data & fake schema.

I doubt this. This is for two reasons: the first is that the development database usually shares the same schema as the production database so that's not an issue.

The second is that fake data convincingly takes care of various issues surrounding de-anonymization of data using correlations among bits of data that ostensibly have had their PII-sensitive bits removed.

If protection of user data is a priority, there are far fewer headaches associated with creating entirely fake data to populate the same schema than trying to figure out post-hoc censoring of production data.

That's not to say there aren't valid use cases of the latter. You often will want to do post-hoc censoring/aggregation if you wish to track e.g. usage metrics. This is in fact often a component of ETLs. However, those are removed from everyday development tasks.