Hacker News new | ask | show | jobs
by pton_xd 85 days ago
"in this paper we primarily evaluate the LLM itself without external tool calls."

Maybe this is a factor?

1 comments

No tools were used.
IIRC, web chat often uses tools / code without surfacing this information in any obvious way.