Hacker News new | ask | show | jobs
by crabbone 1197 days ago
Many years ago I had a conversation with an older colleague of mine where I was overly optimistic about some inter-database tool. The other person was very skeptical of the tool (which proved true short afterwards), but this was less important. The more important thing was that my colleague at the time claimed that whoever creates a tool that is able to automatically connect different databases, in a sense that "John Smith" in one database will be unambiguously linked to "Smith J." in another, which would allow, for example, different government agencies to not burden us, the taxpayers with endless rigamarole of submitting the same information over and over...

So, he claimed that whoever builds such a thing will be instantly the richest person in the world, eclipsing Bill Gates and Jeff Bezos combined.

Well, having worked with many different databases, I can see how that's a mission impossible... So, what does this have to do with anonymization? -- Well, most databases in the world are either built by application developers or are later extended due to the demands of application developers in such a way that the meaning of the data stored in the database is impossible to determine without the application which works with the database. In all but the most trivial cases. Not to mention that data in the databases in majority of cases is generated by humans, and even though both application developers and data administrators try to prevent invalid inputs, they too make mistakes.

To continue the example of DICOM files: those are typically generated by a combo of a technician operating the machine, a radiologist who reads the image, a doctor who ordered the imaging and a medical secretary who collected patient's data upon arrival. All of these people are very busy and have very little time to spend on patients. This often leads to mismatch between field type and data stored in those fields. Eg. patient's address gets stored in the name field, the name is stored in the allergies field and so on. Some data are essential for the file to move around the system, but a lot of the properties won't prevent the file from reaching its target, even if they contain completely nonsensical data.

----

My wife participated in some Kaggle challenges that had to do with chest CT. In order to do that, she went through some of the publicly available sets of images that belong to this general category. Each contained defective images, up to and including CTs of other body parts, X-rays and so on. (Needless to mention that stuff like proper radiological modality was wiped from the set, so there was no contrast information attached to images etc.) And that was only what she could find with some simple scripts which relied on heuristic.

What I'm trying to say is that dealing automatically with large quantities of data that was acquired in real-world situation will almost certainly not live up to expectations. It will require a human in the loop until we have AI comparable to human intelligence.