| Feels like a mixed bag vs regression? eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc). But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA). Hallucination resistance better but only modestly. Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones. |