Hacker News new | ask | show | jobs
by motohagiography 614 days ago
> We find strong evidence that accuracy on today's medical benchmarks is not the most significant factor when analyzing real-world patient data, an insight with implications for future medical LLMs.

I interpreted this as challenging whether answering PubMedQA questions as well as a physician is correlated to recommending successful care paths based on the results (and other outcomes) shown in the sample corpus of medical records.

The analogy is a joke I used to make about ML where it made for crappy self-driving cars but surprisingly good pedestrian and cyclist hunter-killer robots.

Really, LLMs aren't expert system reasoners (yet) and if the medical records all contain the same meta-errors that ultimately kill patients, there's a GIGO problem where the failure mode of AI medical opinions makes the same errors faster and at greater scale. LLMs may be really good at finding how internally consistent an ontology made of language is, where the quality of its results is the effect of that internal logical consistency.

There's probably a pareto distribution of cases where AI is amazing for basic stuff like, "see a doctor" and then conspicuously terrible in some cases where a human is obviously better.

1 comments

An often-ignored/forgotten/unknown fact about utilizing LLMs is that you really need to develop your own benchmark for your specific application/use-case. It’s step 1.

“This model scores higher on MMLU” or some other off-the-shelf benchmark may (likely?) have essentially nothing to do with performance on a given specific use-case, especially when it’s highly specialized.

They can give you a general idea of the capabilities of a model but if you don’t have a benchmark for what you’re trying to do in the end you’re flying blind.

I think that's very true -- and it felt like one of the real opportunities we had in the paper: that we have real production tasks whose results we need to stand behind, and so we can try to explain and show examples of what matters in that context.

One of the sentences near the end that speaks to this is "...[this shows] a case where the type of medical knowledge reflected in common benchmarks is little help getting basic, fundamental questions about a patient right." Point being that you can train on every textbook under the sun, but if you can't say which hospital a record came from, or which date a visit happened as the patient thinks of it, you're toast -- and those seemingly throwaway questions are way harder to get right than people realize. NER can find the dates in a record no problem, but intuitively mapping out how dates are printed in EHR software and how they reflect the workflow of an institution is the critical step needed to pick the right one as the visit date -- that's a whole new world of knowledge that the LLM needs to know, which is not characterized when just comparing results on medical QA.

Giving examples of the crazy things we have to contend is something I can (and will!) gladly talk about for hours...

One other interesting comment in there -- the note about how people think the worst records to deal with are the old handwritten notes. But actually, content-wise they tend to be very to-the-point. Clean printouts from EHR software have so much extra junk and redundancy that you end up with much lower SNR. Even just structuring a single EHR record can require you to look across many pages and do tons of filtering that doesn't come into play on the old handwritten notes (once you get past OCR).

Long way of saying: I feel for today's clinicians. EHRs were supposed to solve all problems, but they've also made things harder in a lot of ways.

Have you seen/heard of Abridge[0]? Long story short their secret sauce comes in two main forms:

1. Accurate speech rec, diarization, etc to record a clinician-patient encounter. No notes, no scribes, no "physician staring at Epic when they should be looking at and talking to you".

2. Parsing of transcripts to correctly and accurately populate the patient EHR record - including various structured fields, etc.

Needless to say you're in this space so I don't have to tell you - every Epic/Cerner install is basically a snowflake so there's a lot going on here, especially at scale.

[0] - https://www.abridge.com/