They are using the “gold standard for the evaluation of expert medical computing systems” not a proxy for what a doctor actually does when diagnosing someone.
It may have some utility after diagnosis, but this test doesn’t demonstrate utility for patients.
It will also tell you you're God and/or a toaster. If you're gonna let benchmarks convince you to listen to an LLM on matters of health it's your funeral, just don't get anyone else killed with you please.
Not quite. An LLM generates text that would likely follow. The sky is… “blue”. A patient in pain with a bone protruding from their shin has a… “broken leg”.
The more training data, the more questions it can answer with a reasonable degree of probability of accuracy.
Throwing away a potentially useful analysis just because it’s probabilistic seems a bit like throwing the baby out with the bath water.
Studies have found that newer reasoning AIs are about as good at diagnosing illness from a written description of symptoms as doctors are.
Granted, it cannot actually examine a patient, so we're not replacing doctors anytime soon. But your view is obsolete.
https://www.science.org/doi/10.1126/science.adz4433