|
|
|
|
|
by kkielhofner
607 days ago
|
|
An often-ignored/forgotten/unknown fact about utilizing LLMs is that you really need to develop your own benchmark for your specific application/use-case. It’s step 1. “This model scores higher on MMLU” or some other off-the-shelf benchmark may (likely?) have essentially nothing to do with performance on a given specific use-case, especially when it’s highly specialized. They can give you a general idea of the capabilities of a model but if you don’t have a benchmark for what you’re trying to do in the end you’re flying blind. |
|
One of the sentences near the end that speaks to this is "...[this shows] a case where the type of medical knowledge reflected in common benchmarks is little help getting basic, fundamental questions about a patient right." Point being that you can train on every textbook under the sun, but if you can't say which hospital a record came from, or which date a visit happened as the patient thinks of it, you're toast -- and those seemingly throwaway questions are way harder to get right than people realize. NER can find the dates in a record no problem, but intuitively mapping out how dates are printed in EHR software and how they reflect the workflow of an institution is the critical step needed to pick the right one as the visit date -- that's a whole new world of knowledge that the LLM needs to know, which is not characterized when just comparing results on medical QA.
Giving examples of the crazy things we have to contend is something I can (and will!) gladly talk about for hours...