|
|
|
|
|
by motohagiography
614 days ago
|
|
> We find strong evidence that accuracy on today's medical benchmarks is not the most significant factor when analyzing real-world patient data, an insight with implications for future medical LLMs. I interpreted this as challenging whether answering PubMedQA questions as well as a physician is correlated to recommending successful care paths based on the results (and other outcomes) shown in the sample corpus of medical records. The analogy is a joke I used to make about ML where it made for crappy self-driving cars but surprisingly good pedestrian and cyclist hunter-killer robots. Really, LLMs aren't expert system reasoners (yet) and if the medical records all contain the same meta-errors that ultimately kill patients, there's a GIGO problem where the failure mode of AI medical opinions makes the same errors faster and at greater scale. LLMs may be really good at finding how internally consistent an ontology made of language is, where the quality of its results is the effect of that internal logical consistency. There's probably a pareto distribution of cases where AI is amazing for basic stuff like, "see a doctor" and then conspicuously terrible in some cases where a human is obviously better. |
|
“This model scores higher on MMLU” or some other off-the-shelf benchmark may (likely?) have essentially nothing to do with performance on a given specific use-case, especially when it’s highly specialized.
They can give you a general idea of the capabilities of a model but if you don’t have a benchmark for what you’re trying to do in the end you’re flying blind.