Hacker News new | ask | show | jobs
by epolanski 4 hours ago
I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistance to the user, but going from a prompt straight to the final solution.