Hacker News new | ask | show | jobs
by bsaul 119 days ago
Not my experience too and i'm on claude code. I'd be really curious to see what when wrong in OP case. Maybe too much indication ? Could it be that it used a fast model instead of the deep ones ?
2 comments

No, OP said he used the Max Opus 4.6.

Anyways, I think one area where Codex and Claude Code falls short is that they do not test the changes they made by using the app.

In this case, the LLM should ideally render the page in a real browser, and actually click on the buttons to verify. Best if the LLM test it before the changes, and then after so that it is the same. Maybe it should take a screenshot of before the change, then take a screenshot after. And match.

I asked why Codex and Claude don't do this here: https://news.ycombinator.com/item?id=46792066

Yeah, if you have these tools in place to validate it's changes you can quickly iterate with it to the right results. But think through how it's making UI changes and it becomes obvious quickly why it can make absolutely wrong and terrible guesses about the implementation details, it can't _see_ what it's doing, or interact with it, it's just pattern matching other implementations its seen.
Yea, the next breakthrough for Codex or Claude Code would be to actually use/test the app like a real human would during the development process.
Here's a document produced by Claude Code using my Showboat testing tool this morning to help explore SeaweedFS (a local S3 clone) - it includes trying things out with curl and getting screenshots from Chrome using my Rodney tool: https://github.com/simonw/research/blob/main/seaweedfs-testi...
You can easily do this, at least with Claude Code. Ask it to install and use Playwright to confirm rendering and flow. You're correct that it is a failing to not do this. When you do, it definitely helps cut down on bugs.

EDIT: Sorry, just noticed you said "real browser". Haven't tried this but Playwright gets you a long way down the road.

Will check it out. Looks like there is also chrome-devtools-mcp for Codex.
FWIW, I've found Playwright tests to be a decent way of getting Claude to do what you're talking about.
See the /chrome command in Claude code.
They say explicitly what model they're using.