Hacker News new | ask | show | jobs
by jogundas 2207 days ago
Full disclosure: I am a cofounder at a startup automating chest X-ray reporting.

It is true that ML algorithms are almost always trained on radiologist labels on the same modality, and thus take in the reader biases. I also agree that some radiologists are better than others as you imply.

As a patient, one does not know who will read their film. IMHO we as an industry should aim not at beating 99.999% of radiologists. We should merely make products which consistently perform not worse than an average radiologist at a particular institution. It is always thrilling to outperform humans with your software, but at the end patient outcomes are what matters. Those are about consistent performance over a long period of time.

Demonstrating this consistent performance is the challenging part, but it is possible to prove it through sufficiently careful and lengthy prospective trials. That’s what we are focusing on, and I would love to see the other players in the industry do the same.

2 comments

Is ML result considered as a first/second opinion, or as just 'a quick check' for reference only?

I believe that ML should not be taken in lieu of human opinion. The consensus, be it medical or legal, has to be explicitly human with all the responsibility attached.

Shifting the responsibility for the misses onto a faceless ML is only eroding trust in the professional opinions and cementing the biases.

I fully agree that treatment should be prescribed by a human doctor who can explain and answer questions. However, I would not agree that each of the data inputs to the treatment decision should be generated manually. That is already not the case.
This is the same argument that is often put forward in relation to accident rates for self driving vehicles. ML only needs to outperform humans.

The problem with this argument is it glosses over that fact that in the tail, where the ML is making a wrong decision, sometimes a catastrophic one, the behaviour of the ML algorithm is not well understood. How can we deploy something in such safety critical applications that we do not fully understand?

Excellent point. That is what the lengthy prospective study phase as well as periodic auditing after deployment are for.

Please also note that there are several important differences as compared to the automotive industry. First, one could argue that the task at hand is trivial as compared to the self driving car. We are operating in a heavily constrained setting with much better understood data inputs and a hundred-year history of medical professionals trying to classify and systematize them. Moreover, our task is not time-critical. It sometimes takes more than a week for such an image to be reported on.