Modern agents can write tests that are meaningful and require the agent to pass them with any change to avoid regressions. Humans can review the code/test the downstream application to ensure it works as intended.
It rarely takes a single prompt to get something done, but the agents can figure out as long as the human is specific about what constitutes accurate.
It's counterintuitive yes, but you can. You can just look at the tests to ensure they're consistent and with the latest models that has always been the case, and it's very rare that the models try to cheat the tests.
you can convince yourself it can, by all means. But it doesn't make it true. In fact we even have this rule for apecoding: a developer cannot review their own code.
It rarely takes a single prompt to get something done, but the agents can figure out as long as the human is specific about what constitutes accurate.