|
|
|
|
|
by deepikaa_s
63 days ago
|
|
Built a clinical safety eval harness covering three failure categories: numerical impossibilities, wrong-premise clinical claims, and unverifiable medication information. Tested GPT-4o, GPT-4.1, GPT-5, GPT-5-mini, Claude Opus, Sonnet, Haiku, Gemini 2.5 Pro and Flash across 25 cases.
The hardest cases require pre-emption stopping before answering when the premise is unverifiable. Most models fail this even when they pass standard safety evals. Code: https://github.com/deepikaa-s/clinical-safety-eval |
|