Hacker News new | ask | show | jobs
by screye 1850 days ago
A devil's advocate:

Giving access to nation's healthcare data for statistical and ML uses can speed up development for ML in diagnostics by a huge amount.

Once you scrub the source data to remove birth dates, report creation dates & zip codes, it should be sufficiently anonymized to be traceable back to the individual. We can enable some level of differential privacy on top as well.

ML's 2 big leaps of the last decade 2012 CNNs and 2017-18 pre-trained transformers both came off the back of a leap in data availability (Imagenet for CNNs and Scraping the entire internet for BERT).

Individual hospitals and the startups they bankroll have their inhouse ML teams, but closed data and unwillingness to disseminate has made the field move at a snail's pace. Additionally, Generalizability of any kind won't be achieved until the data gets scaled up past small geographic pockets and patient sets. This is especially true in medicine which has a long-tail problem. Lastly, aggregating data together lends a natural anonymity to each user who's data is shared within the dataset.

IMO, disease diagnostics is one the most ideal castings for a problem in ML. A purely technical trade where data and decisions have a degree of exactness and concepts like conditional probability are a natural fit. The only problem is that the pipeline is largely still analog. This means that the data collected about the doctor's diagnostic processes still comes out incomplete and privacy protections make sure it stays on a scale small enough to make ML difficult.

2 comments

This data will not be anonymized enough IMO. FTA:

"Data that directly identifies patients will be replaced with unique codes in the new data set, but the NHS will hold the keys to unlock the codes “in certain circumstances, and where there is a valid legal reason”, according to its website."

This makes sense though from a care optimization perspective though, yeah? Say an ML model is developed to predict a rare disease, and person 12345 scores highly by that model but has not been tested. Without IDs, the NHS would need to replicate the model on secure infrastructure to identify and potentially save the life of the person. With IDs, researchers most familiar with the methodology who already have the infrastructure set up can simply give a set of high-scoring IDs to NHS administrators.
"Without IDs, the NHS would need to replicate the model on secure infrastructure to identify and potentially save the life of the person" why is this an issue?
To counter your devils advocate, universities within the UK which perform the most cutting edge research in this specific field are already closely linked to specialist hospitals, and can obtain the kind of data they need on a targeted and explicitly consensual basis. The fallacy that more data inherently leads to better ML affords poorer quality research, and explainability often lags behind in these cases.

Its hard to compare NLP (such as pretrained transformer models) to medical ML because there are real and potentially fatal implications to misdiagnosis. The focus should be on small scale and explainable ML, not brute forcing patterns across large populations (which is more effective for insurance companies than clinicians). FWIW, I'm a massive fan of the potential of CV in diagnosis, and in aiding spotting abnormalities early, but I think the proposed opening up of data is absolutely the wrong way to see innovation in this field.