In Thinking, Fast and Slow -- the author details a double blind trial where the did this. It was worse with humans and AI than with just AI. Humans think they can use AI as a guide and move it in the right direction. But the movements they made, on average, were bad.
Surely in this type of instance (looking at a scan to answer a yes/no question) the human and AI act independently, with the computer being a useful aid because it separately picks up a few of the human's false negatives. Assuming false negatives are a lot worse than false positives, this can only be a good thing.
If they lead to an unnecessary mastectomy then false positives are pretty bad. Not as bad as dying, obviously, but still a severe blow to a woman's identity and sense of self worth.
It's going to be a hard pill to swallow if you have to tell a woman "sorry, we removed your healthy breast because the computer made a mistake."
I think the idea of "screening" is that you don't just race off to a mastectomy the minute some AI model goes off. Of course, putting more false positives through a fallible process of review does run the risk you speak of.
It sounds like a smart hospital would run a patient through both human and AI screenings separately, and a different doctor to examine both results and evaluate the discrepancies. This way you would keep the strengths of both approaches, lowering the failure rates, and depending on the countries health care funding can be good business from the hospital's POV as they get to charge for the extra work as well as the better success rates to drive business.
And I wonder what happens if you apply machine learning to looking at the difference between AI and human screening results.
Radiologists are really bad at detection, even after many years of study. That's quite often due to coarse level of details of scans when only large tumors can be observed or recognized with some certainty. Surpassing humans there is not so difficult, but improving accuracy from e.g. 32% to 34% doesn't really sound like a win :(
> 32% to 34% doesn't really sound like a win :(
We are talking about human lives here, not about beating some CPU benchmark. Detection improvement by 2% is huge in almost any sickness.
remember when ensembles were the cool word before they got erased from collective consciousness and replaced with deep things? it can't even be a decade, was it 2012 or something?
They haven't got erased, but more like subsumed? If you use dropout to train your model that is basically equivalent with using an ensemble of deep neural networks.
If you train an ensemble of models with random dropout, you have an ensemble. Models trained with dropout will still have significant variation from run to run.
No, the point very much is to eventually replace doctors. You just can't easily get there before first going through a doctor-machine cooperation period.
Automation is a friend of society, but is not a friend of individuals working particular jobs. I think doctors are acutely aware of that.