Hacker News new | ask | show | jobs
by TSiege 3 hours ago
Always worth a share for this scenario. It's not clear if LLMs are capable of doing actual analysis on medical imaging. For details see this article https://futurism.com/artificial-intelligence/frontier-models...

> As detailed in a new, yet-to-be-peer-reviewed paper, a team of researchers at Stanford University found that frontier AI models readily generated “detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided.”

> In other words, the AI models happily came up with answers to questions about a supposedly accompanying image — even if the researchers never even showed it an image.

> As opposed to hallucinations, which involve AI models arbitrarily filling in the gaps within a logical framework, the team coined a new term for the phenomenon: “mirage reasoning.”

> The effect “involves constructing a false epistemic frame, i.e., describing a multi-modal input never provided by the user and basing the rest of the conversation on that, therefore changing the context of the task at hand,” the researchers wrote in their paper.

> The damning findings suggest AI models cheat by diving into the data they were given — and coming up with the rest based on probability, even if it’s almost entirely conjecture.

5 comments

I work at a telemedicine company. We’ve benchmarked a few frontier LLMs on public medical imaging datasets. One test included high-quality and high-consensus otoscopic images. We didn’t anticipate the models to do well on something so niche, but what concerned us was how poorly calibrated the models were.

I know you can’t trust an LLM’s self-assessed “confidence” of a prediction, but I’ve found that confidence can at least be directionally correct for some tasks. For our benchmarks, however, confidence was poorly correlated. What’s worse is that binary classification models (“Do you see $diagnosis in this photo?”) highly influenced the LLM to confidently predict $diagnosis.

I’m concerned for those using LLMs for diagnostics, and getting confidently led to the wrong conclusion.

But the binary classification models can be made ternary easily. RL on congruence plus penalty for misdiagnosis is easy to set up and gives great results.

What I’ve seen be the true bottleneck is people not setting up the structured data. But making a tiny reasoning model with OPSD -> GRPO is totally doable with a bit of money.

It makes a lot of sense if you understand how these models work but this was a cool read anyways and studies like this are impotent for curbing the unfortunate fever dream some folks seem to be collectively having about LLM omnipotence
I don’t understand how this is a different result than giving any LLM a task that is not completely grounded? I’ve observed this in coding tasks, if I forget to include a file referred to in the spec, the LLM will just hallucinate a version of it and my results suck. If I give it the file (and really, all the information I claimed it had access to), the task works fine. I fixed this in my pipeline with a prompt that does an extensive grounding analysis to determine if the assets I’m giving it are complete with respect to the spec (and that the spec is grounded as well, ie it doesn’t refer to something that is undefined).

I wonder if the above problem can be fixed similarly? Just ask the LLM to do a conservative grounding analysis before jumping to the main task?

It's not different- there's a line of research and reasoning where people who don't use LLM's regularly point out issues that have been known (and more or less solved) for more than a year now (which is an eternity in the LLM space).
The absolute only thing that matters is if they are provided an image what's the success rate.
But why should I care? If you demonstrated that a model can perform more accurate diagnoses than a doctor, but also it had this strange behavior when no image was presented, why should that deter me from using the model?
Because you don’t have any way of telling if it actually used the image presented, or based it’s conclusions on a different image it made up
Really? You know you could just ask it.