| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bsaul 119 days ago
	Not my experience too and i'm on claude code. I'd be really curious to see what when wrong in OP case. Maybe too much indication ? Could it be that it used a fast model instead of the deep ones ?

2 comments

aurareturn 119 days ago

No, OP said he used the Max Opus 4.6.

Anyways, I think one area where Codex and Claude Code falls short is that they do not test the changes they made by using the app.

In this case, the LLM should ideally render the page in a real browser, and actually click on the buttons to verify. Best if the LLM test it before the changes, and then after so that it is the same. Maybe it should take a screenshot of before the change, then take a screenshot after. And match.

I asked why Codex and Claude don't do this here: https://news.ycombinator.com/item?id=46792066

link

threetonesun 119 days ago

Yeah, if you have these tools in place to validate it's changes you can quickly iterate with it to the right results. But think through how it's making UI changes and it becomes obvious quickly why it can make absolutely wrong and terrible guesses about the implementation details, it can't _see_ what it's doing, or interact with it, it's just pattern matching other implementations its seen.

link

aurareturn 119 days ago

Yea, the next breakthrough for Codex or Claude Code would be to actually use/test the app like a real human would during the development process.

link

simonw 119 days ago

Here's a document produced by Claude Code using my Showboat testing tool this morning to help explore SeaweedFS (a local S3 clone) - it includes trying things out with curl and getting screenshots from Chrome using my Rodney tool: https://github.com/simonw/research/blob/main/seaweedfs-testi...

link

mwigdahl 119 days ago

You can easily do this, at least with Claude Code. Ask it to install and use Playwright to confirm rendering and flow. You're correct that it is a failing to not do this. When you do, it definitely helps cut down on bugs.

EDIT: Sorry, just noticed you said "real browser". Haven't tried this but Playwright gets you a long way down the road.

link

aurareturn 119 days ago

Will check it out. Looks like there is also chrome-devtools-mcp for Codex.

link

lenerdenator 119 days ago

FWIW, I've found Playwright tests to be a decent way of getting Claude to do what you're talking about.

link

throwup238 119 days ago

See the /chrome command in Claude code.

link

n4r9 119 days ago

They say explicitly what model they're using.

link