| A devil's advocate: Giving access to nation's healthcare data for statistical and ML uses can speed up development for ML in diagnostics by a huge amount. Once you scrub the source data to remove birth dates, report creation dates & zip codes, it should be sufficiently anonymized to be traceable back to the individual. We can enable some level of differential privacy on top as well. ML's 2 big leaps of the last decade 2012 CNNs and 2017-18 pre-trained transformers both came off the back of a leap in data availability (Imagenet for CNNs and Scraping the entire internet for BERT). Individual hospitals and the startups they bankroll have their inhouse ML teams, but closed data and unwillingness to disseminate has made the field move at a snail's pace. Additionally, Generalizability of any kind won't be achieved until the data gets scaled up past small geographic pockets and patient sets. This is especially true in medicine which has a long-tail problem. Lastly, aggregating data together lends a natural anonymity to each user who's data is shared within the dataset. IMO, disease diagnostics is one the most ideal castings for a problem in ML. A purely technical trade where data and decisions have a degree of exactness and concepts like conditional probability are a natural fit. The only problem is that the pipeline is largely still analog. This means that the data collected about the doctor's diagnostic processes still comes out incomplete and privacy protections make sure it stays on a scale small enough to make ML difficult. |
"Data that directly identifies patients will be replaced with unique codes in the new data set, but the NHS will hold the keys to unlock the codes “in certain circumstances, and where there is a valid legal reason”, according to its website."