| This is long overdue for biomedicine. Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws. Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified. We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry. If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format. Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects. Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients. |