Hacker News new | ask | show | jobs
by simianwords 85 days ago
There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example?

Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...

2 comments

Did you use the exact API call shown in the paper? I am unable to replicate the paper's counterexamples via the chat UI, but that's not very surprising (if the LLM already only fails a few cases out of thousands, the small differences in context between API and chat might fix them).
"in this paper we primarily evaluate the LLM itself without external tool calls."

Maybe this is a factor?

No tools were used.
IIRC, web chat often uses tools / code without surfacing this information in any obvious way.