Hacker News new | ask | show | jobs
by pastescreenshot 90 days ago
The interesting question to me is not whether the system can generate a plausible PR-time test, but whether the useful ones survive after the PR is gone. If Canary catches a real regression, how often can that check be promoted into a stable long-lived regression test without turning into a flaky, environment-coupled browser script? That conversion rate feels closer to the real moat than the generation demo.
1 comments

Good point. To keep the regression tests reliable as the app evolves, we run a reliability cascade. First, we generate and execute deterministic Playwright from the codebase. If execution fails then we fall back to DOM and aria tree. If that still fails, we fall back to vision agents that verify what the user actually sees before flagging a drift in the application behavior