| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by motohagiography 2950 days ago

De-identification of data sets (like cryptography) is a very difficult problem.

It is great that people are building tools for this. Even if I were skeptical of one or another in particular, the availability of tools popularizes the discussion of what is necessary and sufficient for de-identifying data.

The main use case I worked on was how to test an event driven (SOA at the time) pipeline without production data. Health information handling is very tightly regulated, so generating a test data set large enough that reflected the needs of the system was a significant challenge. Engineers couldn't just copy some production data and use it for testing. The regime I worked in that defined these rules (early PHIPA, PIPEDA in Ontario) is not unlike what people may encounter with GDPR.

When I was doing this sort of work, I found that it made more sense to find the structure of the data, then synthesize it from scratch. For a data format like HL7, this is non-trivial.

Synthesizing a few gigabytes of json/xml/text from a small training corpus provides incomplete test data. There are a few companies in the de-identification business, and I remember a few consulting services for it.

I can think of a few ways to do this, and they aren't simple.