| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 1e-9 2611 days ago
	I would say your criticism is way off base. I've developed and fielded ML-based medical devices and this looks like a reasonable study that suggests they have a system worthy of further testing. There's nothing wrong with using an ROC curve here and they document the experience of the doctors, so they weren't hiding that and around 60 or so doctors had greater than 5 years experience. Also, studies like this generally don't use only biopsy-proven negatives, since that would bias the negatives towards those that were suspicious enough to biopsy. Without knowing more details than what the paper provides, I cannot say the results are valid, but I also don't see any terrible errors after a quick scan. The main weakness is probably the fact that the test set came from the same image archive used for development. As a result, there can be all sorts of biases the CNN is using to inflate its performance unbeknownst to the developers. The best way to eliminate that concern is to use a test set gathered through a different data collection effort using different clinics, but that is expensive and time consuming and not something I would do initially. This looks like a good first step and I would encourage the developers to carry it further. EDIT: I'll add that the ratio of positives to negatives in the training set is irrelevant and in no way invalidates the study. As far as testing goes, there is always a balance you must strike in a reader study involving doctors. Ideally, you would have the exact ratio a doctor would encounter in practice, but for a screening study, that is typically impractical as you would need a huge number of cases and doctor time is expensive. A ratio of 1 positive to 4 negatives is entirely reasonable, although the doctors (particularly the less experienced ones) will almost certainly have an elevated sensitivity and reduced specificity since they will know it is an enriched set, but this is reasonable for ROC comparison purposes as it mostly just selects a different point on the doctor's personal ROC curve. Note that some studies even tell the doctors beforehand what percentage of cases are positive.