Hacker News new | ask | show | jobs
by throwaway041207 43 days ago
> Just as "use code for contracts" failed for crypto currencies, "use AI output as prod" will fail for AI. Both is based on "just don't make catastrophic mistakes anymore".

What I think will happen is AI will write code and it will do the best it can to mitigate mistakes prior to rollout, but once rollout time occurs, rollout will be incremental and it will self monitor by defining success conditions at rollout time. The nature of the code will mitigate "catastrophe" to a small group at worst, but most likely initial rollout will just run new versions of the code in a simulated context (language design could benefit from this) and analyze potential outcomes without affecting current functionality.

But when the code goes live... it will be slowly scope changes progressively (think feature/experiment flags) and if it fails in the initial cohort, it will redirect. If success is positive, it will increase the rollout cohort.

This is a normal software engineering practice today, but it's labor and process intensive when driven by humans. But in a world where humans are less involved, this process is scalable.

2 comments

This assumes failures can be detected and fixed more easily than generating the corresponding change. I am not convinced that's the case.

Counter points to my own arguments:

1. We don't know yet in detail what AI is good at.

2. AI doesn't need to be perfect, just "good enough", whatever that means for a specific project. More failures while saving hundreds of thousands dollars each year might be acceptable, for example.

> 2. AI doesn't need to be perfect, just "good enough", whatever that means for a specific project. More failures while saving hundreds of thousands dollars each year might be acceptable, for example.

This I think is the unexplored aspect of what's happening right now. Guardrails around "good enough" systems is where the future value lies. In the future code will never be as good as when the artisans were writing it, but if you have an automated process to validate/verify mediocre code (and kick it back to AI for refinement when it fails) before it's fully productionized, then you have a pathway to scaling agentic coding.

Validating / Verifying mediocre code is pretty hard as nobody was able to agree what that even means.
If you are working with AI to define the purpose and goal of the change -- which is to say planning how the changes to the code should result in some sort of feature/bugfix/whatever, then planning phase should ask you to define clear success conditions for the code that it writes. These could be otel/datadog metrics, or some kind of funnel metric or some cessation of errors in your APM, whatern. In any case the outcome of the change is what I mean by validate/verify. Mediocre code can solve issues and we can tolerate mediocre code in that sense. The guardrails kick back failing "mediocre" code, it accepts working mediocre code.

And this could easily apply to every change we made by hand before AI, it was just a tedious process to layer these things into code when we were just fixing bugs and whatnot. In an AI writes all the code world adding this kind of stuff as table stakes for a changeset is zero cost, effort wise.

Functional requirements can be handled easily that way, yes. Maintainability however is about non-functional requirements like low complexity / decoupling.

To me the trend seems to be that AI produce the same challenges as human did before and that the same solutions are helping. Without a good maintainable code base, AI will eventually fail to even fulfill quantifiable requirements of changes.

That's kind of the point of software since the beginning. Nobody cares about the easy stuff that can be produced without much effort and what's possible without much effort has changed dramatically over the years.

> rollout will be incremental and it will self monitor by defining success conditions at rollout time.

This sounds a lot like allowing an LLM to define tests as well as implementation, and allowing the LLM to update the tests to make the code pass. Recently people have come to understand (again?) that testing and evaluation works better outside of the sandbox.

Sorry I wasn't very clear about that part. I think success conditions are described by stakeholders, whoever that is, and then the implementation of monitoring them is probably created by the LLM. For engineering level stakeholders that's going to be metrics, performance, etc. Whereas for more business side stakeholders that'll be a mix of data metrics and product feature metrics, click-through rates, stuff like that