|
|
|
|
|
by motohagiography
2950 days ago
|
|
De-identification of data sets (like cryptography) is a very difficult problem. It is great that people are building tools for this. Even if I were skeptical of one or another in particular, the availability of tools popularizes the discussion of what is necessary and sufficient for de-identifying data. The main use case I worked on was how to test an event driven (SOA at the time) pipeline without production data. Health information handling is very tightly regulated, so generating a test data set large enough that reflected the needs of the system was a significant challenge. Engineers couldn't just copy some production data and use it for testing. The regime I worked in that defined these rules (early PHIPA, PIPEDA in Ontario) is not unlike what people may encounter with GDPR. When I was doing this sort of work, I found that it made more sense to find the structure of the data, then synthesize it from scratch. For a data format like HL7, this is non-trivial. Synthesizing a few gigabytes of json/xml/text from a small training corpus provides incomplete test data. There are a few companies in the de-identification business, and I remember a few consulting services for it. I can think of a few ways to do this, and they aren't simple. |
|