There’s more than two options here. It was already difficult to deal with self diagnosis for doctors, now we have a machine that outputs recommendations, and does it with confidence whether it’s correct or not.
The same issues that were present with search-engine self diagnosis are still present with LLMs. If you provide Google with an incomplete list of symptoms and can’t interpret the information you find correctly, you will likely get an incorrect diagnosis. The same is true for LLM output.
There are quite a few disclaimers everywhere that soften confidence: "always ask a medical specialist", "I'm not a doctor", "this could have been this or that but really not sure", etc.
Nightmare because users approach LLMs with the false confidence that they're always right, and present LLM outputs as fact to Doctors who have to waste time explaining that it's wrong most of the time. It hurts more than it helps.
Its a nightmare because it erodes trust. Doctors are not "always right" which is why "always get a second opinion" is codified in culture.
But AI's problem is that its completely full of shit, sometimes, and the people most qualified to evaluate whether its full of shit are the doctors, not the patients, but just like OP's original article, patients are left feeling like their second opinion from AI might be more trustworthy than their doctors opinion.
They are using the “gold standard for the evaluation of expert medical computing systems” not a proxy for what a doctor actually does when diagnosing someone.
It may have some utility after diagnosis, but this test doesn’t demonstrate utility for patients.
It will also tell you you're God and/or a toaster. If you're gonna let benchmarks convince you to listen to an LLM on matters of health it's your funeral, just don't get anyone else killed with you please.
Not quite. An LLM generates text that would likely follow. The sky is… “blue”. A patient in pain with a bone protruding from their shin has a… “broken leg”.
The more training data, the more questions it can answer with a reasonable degree of probability of accuracy.
Throwing away a potentially useful analysis just because it’s probabilistic seems a bit like throwing the baby out with the bath water.
The same issues that were present with search-engine self diagnosis are still present with LLMs. If you provide Google with an incomplete list of symptoms and can’t interpret the information you find correctly, you will likely get an incorrect diagnosis. The same is true for LLM output.