|
|
|
|
|
by LukeEF
1833 days ago
|
|
that makes sense - I suppose using my model you could mask some of the data in a versioned graph or a collection contained in the database that can be surfaced up to other users who can then clone the collection that excludes PII. You could run the main collection and the PII free collection in the same data product. This might be an easier approach than creating fake data & fake schema. |
|
I doubt this. This is for two reasons: the first is that the development database usually shares the same schema as the production database so that's not an issue.
The second is that fake data convincingly takes care of various issues surrounding de-anonymization of data using correlations among bits of data that ostensibly have had their PII-sensitive bits removed.
If protection of user data is a priority, there are far fewer headaches associated with creating entirely fake data to populate the same schema than trying to figure out post-hoc censoring of production data.
That's not to say there aren't valid use cases of the latter. You often will want to do post-hoc censoring/aggregation if you wish to track e.g. usage metrics. This is in fact often a component of ETLs. However, those are removed from everyday development tasks.