Hacker News new | ask | show | jobs
by epolanski 4 days ago
The problem is that this is very hard to replicate and benchmarks focus on E2E tests, going from one prompt to the final solution.

They do not test how models perform when used interactively, like most of us do.