Hacker News new | ask | show | jobs
by fluidcruft 1550 days ago
I think this is interesting but I'm having trouble seeing how it would apply to the sorts of machine learning tasks that are drawing heavy interest in a radiology department. How does it apply to, say, development or testing of image segmentation tools? Quite often vendors want to sell us software and we would very much like to test it at scale on our own data to see whether it's trash or not because procurement is a beast. Does this sort of tool provide that sort of an interface somehow? I can see how it works for tablular data, I'm just not sure how you can guarantee PHI is fuzzed sufficiently in images.
1 comments

Here is how it would work in theory (not including the scalability question of working with heavy DICOM files and huge DNN). I'm assuming your data is made of records composed by an image and some information about the image or the patient.

The system will generate a fake dataset with the exact same structure and schema (the information on patients is realistic, the images look reasonable and importantly has the right encoding, size, etc.). The purpose of this fake data is for the vendor to adjust their algorithm to be able to consume your data as it is. The vendor builds up the preprocessing on the fake data and then submit their data job to the API (say a preprocessing function to be applied on each record and a Tensorflow model to be fitted on the data, or just to measure the performance on the data). The preprocessing code runs on the original records, the model would be trained or validated against the real data. In the end they can prove the value of their model without having to get their hands on the real data.

The problem we generally have is that plugging the vendor's [insert tensorflow model component] into our network seems to always become an operational no-go prior to purchase due to a variety of reasons including intrusiveness and questions about privacy and the vendor's ability to manipulate the process to get access to datasets. So it's actually the preprocessing step that's we keep hitting as the pain point. In some cases we generate de-identified datasets for demonstration and testing but it can be very labor intensive.

I've not encountered differential privacy in my work before now, but at least for dealing with metadata in the DICOM it could probably be helpful for some datasets. But it could still be challenging to ensure the IODs are correct (or that known quirks are preserved). Anyway this is very interesting. I have a colleague who is working on some utilization/value research using billing records and I'll show him this.

Thanks! Our goal is that no matter what preprocessing function they pass, the only end up accessing outputs that comply with the privacy policies. The code gets access to the real data but it is shielded from the vendor who can only see protected outputs. It should address the risk of private information being exposed to them, but for sure, the more sophisticated the preprocessing code will be, the more challenging it will become. Deep learning on Dicom data is pushing the system to the edge a bit.