| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by simianwords 85 days ago
	There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example? Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...

2 comments

stratos123 85 days ago

Did you use the exact API call shown in the paper? I am unable to replicate the paper's counterexamples via the chat UI, but that's not very surprising (if the LLM already only fails a few cases out of thousands, the small differences in context between API and chat might fix them).

link

simianwords 85 days ago

I tried this https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...

link

pton_xd 85 days ago

"in this paper we primarily evaluate the LLM itself without external tool calls."

Maybe this is a factor?

link

simianwords 85 days ago

No tools were used.

link

chromacity 85 days ago

IIRC, web chat often uses tools / code without surfacing this information in any obvious way.

link