Hacker News new | ask | show | jobs
by maximeago 1550 days ago
Sarus is designed for all data use cases, provided that access to a given user's information is not the objective. This is the case for all of BI, analytics, or machine learning. It also works for testing or debugging, building APIs, etc. It resonates with organizations' aspiration for the democratization of data.

Differential privacy provides much better protection than data masking, but most importantly, it does not require any manual decision (which column to mask, how, etc.). This is what makes it easy to apply at scale to all datasets in the data warehouse or data lake instead of having dataset per dataset decision making involved.

Differential privacy is used by Apple, Google, Microsoft, or the US Census. When used properly, the data protection it provides does not need to be proven to regulators or security teams anymore. That being said, regulators do not require DP protection per se. They require organizations to put in place the best practices in terms of data governance, data minimization, or data security as a whole. This is part of the answer.

1 comments

I think this is interesting but I'm having trouble seeing how it would apply to the sorts of machine learning tasks that are drawing heavy interest in a radiology department. How does it apply to, say, development or testing of image segmentation tools? Quite often vendors want to sell us software and we would very much like to test it at scale on our own data to see whether it's trash or not because procurement is a beast. Does this sort of tool provide that sort of an interface somehow? I can see how it works for tablular data, I'm just not sure how you can guarantee PHI is fuzzed sufficiently in images.
Here is how it would work in theory (not including the scalability question of working with heavy DICOM files and huge DNN). I'm assuming your data is made of records composed by an image and some information about the image or the patient.

The system will generate a fake dataset with the exact same structure and schema (the information on patients is realistic, the images look reasonable and importantly has the right encoding, size, etc.). The purpose of this fake data is for the vendor to adjust their algorithm to be able to consume your data as it is. The vendor builds up the preprocessing on the fake data and then submit their data job to the API (say a preprocessing function to be applied on each record and a Tensorflow model to be fitted on the data, or just to measure the performance on the data). The preprocessing code runs on the original records, the model would be trained or validated against the real data. In the end they can prove the value of their model without having to get their hands on the real data.

The problem we generally have is that plugging the vendor's [insert tensorflow model component] into our network seems to always become an operational no-go prior to purchase due to a variety of reasons including intrusiveness and questions about privacy and the vendor's ability to manipulate the process to get access to datasets. So it's actually the preprocessing step that's we keep hitting as the pain point. In some cases we generate de-identified datasets for demonstration and testing but it can be very labor intensive.

I've not encountered differential privacy in my work before now, but at least for dealing with metadata in the DICOM it could probably be helpful for some datasets. But it could still be challenging to ensure the IODs are correct (or that known quirks are preserved). Anyway this is very interesting. I have a colleague who is working on some utilization/value research using billing records and I'll show him this.

Thanks! Our goal is that no matter what preprocessing function they pass, the only end up accessing outputs that comply with the privacy policies. The code gets access to the real data but it is shielded from the vendor who can only see protected outputs. It should address the risk of private information being exposed to them, but for sure, the more sophisticated the preprocessing code will be, the more challenging it will become. Deep learning on Dicom data is pushing the system to the edge a bit.