|
I do machine learning work in healthcare, and work for a HIPAA covered entity. The issue of permissions and data access often gets applied in an unnecessarily strict fashion to data scientists in these environments, often due to a lack of understanding from engineering managers who have been brought in from a non-regulated environment (e.g., hiring a salesforce engineering manager into a healthcare system so they can "disrupt" or "solve" healthcare -- they hear "regulation" and immediately clamp down on everything). HIPAA allows use of clinical data for treatment, payment, and operations. You can also get around consent issues if the data is properly deidentified. If you have a data scientist who is working to further treatment, payment, or operations (i.e., isn't working on purely marketing uses, selling the data, or doing "real" research), then they are allowed to use the data, assuming it's the minimum necessary for their job. For training machine learning models that support operations, "minimum necessary" is probably a lot of data. And, obviously, the production pipelines and training/experimentation/development would need access to the same amount of data if you want to train and deploy models. Data scientists are also likely to be the first to notice problems with how your product is working, often before the data engineering team. At my company, I've found numerous bugs in our data engineering pipelines and production code because I've seen anomalies in the data and went digging through the data warehouse, replicas of the production databases, and within our actual product. You probably want to support and encourage that kind of sleuthing - but each organization is different, so maybe you have better QA that's more attuned to data issues. My opinion, from having done this for over a decade, is that the question shouldn't be about how much access you give your data scientists. They should have access to nearly all of the data that's within their domain, assuming they're legally entitled to it. The question you should be solving for is what restrictions should be placed on how they access and process that data: e.g., have EC2 instances and centralized jupyter notebooks available for them to download and process data, and prohibit storing data on a laptop. |
It's very difficult to build rules and policies that allow broad access while maintain minimum necessary. Some project may be completely justified in accessing "all" (waves hands) data at its conception but slowly morphing to focus on only a few key identifiers while still processing "all" data.