|
|
|
|
|
by shipilovya
392 days ago
|
|
Last week OpenAI released HealthBench, the most comprehensive set of evals for health to date. The top 3 scoring models all spiked on different things: - GPT-4.1 is best when you need a straight answer
- o3 is best for complex cases
- Grok is best at clarifying important info (“truthseeking”) Made this prototype mostly to understand HealthBench deeper. I will probably use it in the future products I make. |
|