Hacker News new | ask | show | jobs
by deepikaa_s 63 days ago
Built a clinical safety eval harness covering three failure categories: numerical impossibilities, wrong-premise clinical claims, and unverifiable medication information. Tested GPT-4o, GPT-4.1, GPT-5, GPT-5-mini, Claude Opus, Sonnet, Haiku, Gemini 2.5 Pro and Flash across 25 cases. The hardest cases require pre-emption stopping before answering when the premise is unverifiable. Most models fail this even when they pass standard safety evals.

Code: https://github.com/deepikaa-s/clinical-safety-eval