| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fishtoaster 144 days ago

Figuring out how to trust AI-written code faster is the project of software engineering for the next few years, IMO.

We'll need to figure out the techniques and strategies that let us merge AI code sight unseen. Some ideas that have already started floating around:

- Include the spec for the change in your PR and only bother reviewing that, on the assumption that the AI faithfully executed it

- Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis

- Get better ai-based review: greptile and bugbot and half a dozen others

- Lean into your observability tooling so that AIs can fix your production bugs so fast they don't even matter.

None of these seem fully sufficient right now, but it's such a new problem that I suspect we'll be figuring this out for the next few years at least. Maybe one of these becomes the silver bullet or maybe it's just a bunch of lead bullets.

But anyone who's able to ship AI code without human review (and without their codebase collapsing) will run circles around the rest.

13 comments

sarchertech 144 days ago

Translating from a natural language spec to code involves a truly massive amount of decision making.

For a non trivial program, 2 implementations of the same natural language spec will have thousands of observable differences.

Where we are today, that is agents require guardrails to keep from spinning out, there is no way to let agents work on code autonomously that won’t end up with all of those observable differences constantly shifting, resulting in unusable software.

Tests can’t prevent this because for a test suite to cover all observable behavior, it would need to be more complex than the code. In which case, it wouldn’t be any easier for machine or human to understand.

The only solution to this problem is that LLMs get better. Personally I think at the point they can pull this off, they can do any white collar job, and there’s not point in planning for that future because it results in either Mad Mad or Star Trek.

wtallis 144 days ago

> Tests can’t prevent this because for a test suite to cover all observable behavior, it would need to be more complex than the code. In which case, it wouldn’t be any easier for machine or human to understand.

I don't think "complex" is the right word here. A test suite would generally be more verbose than the implementation, but a lot of the time it can simply be a long list of input->output pairs that are individually very comprehensible and easily reviewable to a human. The hard part is usually discovering what isn't covered by the test case, rather than validating the correctness of the test cases you do have.

skydhash 144 days ago

Code is like f(x)=ax+b. You test would be a list of (x,y) tuple. You don’t verify the correctness of your points because they come from some source that you hold as true. What you want is the generic solution (the theory) proposed by the formula. And your test would be just a small set of points, mostly to ensure that no one has changed the a and b parameters. But if you have a finite number of points, The AI is more likely to give you a complicated spline formula than the simple formula above. Unless the tokens in the prompts push it to the right domain space. (Usually meaning that the problem is solved already)

Real code has more dimensionality than the above example. Experts have the right keywords, but even then that’s a whole of dice. And coming up with enough sample test cases is more arduous than writing the implementation.

Unless there’s no real solution (dimensionality is high), but we have a lot of tests data with a lower dimensionality than the problem. This used to be called machine learning and we have metrics like accuracy for it.

wizzwizz4 144 days ago

If some of those input-output pairs are the result of a different interpretation of the spec to other input-output pairs, it's possible that no program satisfies all the tests (or, worse, that a program that satisfies all the tests isn't correct).

sarchertech 144 days ago

At some point verbosity becomes complexity. If you’re talking all observable behavior the input and output pairs are likely to be quite verbose/complex.

Imagine testing a game where the inputs are the possible states of game, and the possible control inputs, and the outputs are the states that could result.

Of course very few human written programs require this level of testing, but if you are trying to prevent an a swarm of agents from changing observable behavior without human review, that’s what you’d need.

Even with simpler input output pairs, an AI tells you it added a feature and had to change 2,000 input/output pairs to do so. How do you verify that those were necessary to change, and how do you verify that you actually have enough cases to prevent the AI from doing something dumb?

Oops you didn’t have a test that said that items shouldn’t turn completely transparent when you drag them.

logicchains 144 days ago

>For a non trivial program, 2 implementations of the same natural language spec will have thousands of observable differences.

If they're not defined in the spec then these differences shouldn't matter, they're just implementation details. And if they do matter, then they should be included in the spec; a natural language spec that doesn't specify some things that should be specified is not a good spec.

halfcat 144 days ago

> we just need to make the spec perfect

So, never.

Greg Kroah-Hartman was once asked by his boss, ”when will Linux be done?” and he said, ”when people stop making new hardware”, that even today, when we assume the hardware won’t lie, much of the work in maintaining Linux is around hardware bugs.

So even at the lowest levels of software development, you can’t know the bugs you’re going to have until you partially solve the problem and find out that this combination of hardware and drivers produces an error, and you only find that out because someone with that combination tried it. There is no way to prevent that by “make better spec”.

But that’s always been true. Basically it’s the 3-body-problem. On the spectrum of simple-complicated-complex, you can calculate the future state of a system if it’s simple, or “only complicated” (sometimes), but you literally cannot know the future state of complex systems without simulating them, running each step and finding out.

And it gets worse. Software ranges from simple to complicated to complex. But it exists within a complex hardware environment, and also within a complex business environment where people change and interest rates change and motives change from month to month.

There is no “correct spec”.

sarchertech 144 days ago

There are a limitless number of implementation details you don't think you care about until they are constantly changing.

I doubt there exists a single piece of nontrivial software today where you could randomly alter 5% of the implementation details while keeping to the spec, without resulting in a flood of support tickets.

Herring 144 days ago

Agreed, but with one exception: are tests supposed to cover all observable behavior? Usually people are happy with just eliminating large/easy classes of bad (unintended) behavior, otherwise they go for formal verification which is an entirely different ballgame.

sarchertech 144 days ago

No they aren’t because they can’t (at least not without becoming so complicated that there’s no longer a point).

But humans are much better at reasoning about whether a change is going to impact observable behavior than current LLMs are as evidenced by the fact that LLMs require a test suite or something similar to build a working app longer than a few thousand lines.

gopalv 144 days ago

> We'll need to figure out the techniques and strategies that let us merge AI code sight unseen

Every strategy which worked with an off-shore team in India works well for AI.

Sometime in mid 2017, I found myself running out of hours in the day stopping code from being merged.

On one hand, I needed to stamp the PRs because I was an ASF PMC member and not a lot of the folks who were opening JIRAs were & this wasn't a tech debt friendly culture, because someone from LinkedIn or Netflix or EMR could say "Your PR is shit, why did you merge it?" and "Well, we had a release due in 6 days" is not an answer.

Claude has been a drop-in replacement for the same problem, where I have to exercise the exact same muscles, though a lot easier because I can tell the AI that "This is completely wrong, throw it away and start over" without involving Claude's manager in the conversation.

The manager conversations were warranted and I learned to be nicer two years into that experience [1], but it's a soft skill which I no longer use with AI.

Every single method which worked with a remote team in a different timezone works with AI for me & perhaps better, because they're all clones of the best available - specs, pre-commit verifiers, mandatory reviews by someone uncommitted on the deadline, ease of reproducing bugs outside production and less clever code over all.

[1] - https://notmysock.org/blog/2018/Nov/17/

mentalgear 144 days ago

> Every strategy which worked with an off-shore team in India works well for AI.

Why hasn't SWE then not been completely outsourced for 20 years. Corporations were certainly trying hard.

arunabha 143 days ago

Cost. Claude code is two orders of magnitude cheaper than an offshore dev.

mentalgear 143 days ago

we are talking 20 - 30 years back when offshore was and still is cheaper.

ahsisjb 144 days ago

> Figuring out how to trust AI-written code faster is the project of software engineering for the next few years, IMO

Replace AI written with “cheap dev written” and think about why that isn’t already true.

The bottleneck is a competent dev understanding a project. Always has been.

Another fundamental flaw is you can’t trust LLMs. It’s fundamentally impossible compared to the way you trust a human. Humans make mistakes. LLMs do not. Anything “wrong” they do is them working exactly as designed.

NewsaHackO 144 days ago

>Humans make mistakes. LLMs do not. Anything “wrong” they do is them working exactly as designed.

This requires a redefinition of the term mistake, no?

zer00eyz 144 days ago

> Include the spec for the change in your PR

We would have to get very good at these. It's completely antithetical to the agile idea where we convey tasks via pantomime and post it rather than formal requirements. I wont even get started on the lack of inline documentation and its ongoing disappearance.

> Lean harder on your deterministic verification: unit tests, full stack tests,

Unit tests are so very limited. Effective but not the panacea that the industry thought it was going to be. The conversation about simulation and emulation needs to happen, and it has barely started.

> We'll need to figure out the techniques and strategies that let us merge AI code sight unseen.

Most people who write software are really bad at reading other's code, and doing systems level thinking. This starts at hiring, the leet code interview has stocked our industry with people who have never been vetted, or measured on these skills.

> But anyone who's able to ship AI code without human review

Imagine we made every one go back to the office, and then randomly put LSD in the coffee maker once a week. The hallucination problem is always going to be NON ZERO. If you are bundling the context in, you might not be able to limit it (short of using two models adversarially). That doesn't even deal with the "confidently wrong" issue... what's an LLM going to do with something like this: https://news.ycombinator.com/item?id=47252971 (random bit flips).

We haven't even talked about the human factors (bad product ideas, poor UI, etc) that engineers push back against and an LLM likely wont.

That doesn't mean you're completely wrong: those who embrace AI as a power tool, and use it to build their app, and tooling that increases velocity (on useful features) are going to be the winners.

orsorna 144 days ago

>Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis

It's wild that the gamut of PRs being zipped around don't even do these. You would run such validations as a human...

dwb 144 days ago

What is this obsession with specifications? For a start it’s certainly not fair to assume an LLM has translated it into correct code, even if there is one reasonable way to do so, and there probably isn’t. I like a good, well-targeted spec as much as anyone, but come on. A spec detailed enough to describe a program is more-or-less the program but written in a non-executable language. I want to review the code, not a spec.

xantronix 144 days ago

I can't help but think that the logical conclusion of spec-first development is a return to Waterfall methodology. The amount of rigour required almost entirely negates the speed advantages of LLMs, even in the hands of seasoned developers. Unless the stakeholders are external, there will always be that necessary organisational bottleneck; of course, the C-suite could always decide to foist project management entirely on individual contributors, or take it on themselves, but I see that ending either in burnout or eventual neglect. All in the service of being on the forefront of adoption, and for what end?

dwb 144 days ago

Yeah, agree. Either that or this idea of not reviewing the code at all takes hold, abdicating human engineering responsibility to the machines, until some big stupid disaster or when it’s Too Late.

whattheheckheck 144 days ago

Is the teleological fight. Do swe decide what the purpose of the system is or do non technical people?

Intention flows are important

dwb 144 days ago

I’m not talking about which humans decide the purpose of the system, or even which humans engineer the system once designed at a higher level. I’m worried about leaving crucial decisions and understanding to LLMs, with humans just stepping back.

elliemdaw 136 days ago

I think part of this gap is that the things we're verifying and the things we're reviewing are at different layers of abstraction. So when there's a ton more code, it takes way more mental load to review it all because engineers have to do this abstraction over a much higher volume of code. Treating the higher layer of abstraction as its own primitive that needs review isn't perfect but definitely helps... so each code diff also includes an architecture diff for example

gspr 144 days ago

> We'll need to figure out the techniques and strategies that let us merge AI code sight unseen.

Why do you assume that's doable? I'm not saying it's not, but it seems strange to just take for granted that it is.

fishtoaster 144 days ago

Why do you assume I assume it's doable? :P

For real, I'm not certain we will ever be able to merge AI code without human review. But:

1. Every time I've confidently though "AI will never be able to do X" in the last year, I've later been proven wrong, so I'm a bit wary to assume that again without strong reasons.

2. I see blog posts by some of the most AI-forward people that seems to imply some people are already managing large codebases without human review of raw code. Maybe they're full of crap - there are certainly plenty of over-credulous bs artists in the AI space - but maybe they're not.

3. The returns on figuring this out are so incredibly high that, if it's possible, people will figure it out.

All that to say: it's far from certain, but my bias is that it is possible.

gspr 143 days ago

> Why do you assume I assume it's doable? :P

Because you say we need to figure out techniques to do it. If it's not possible, then there are no techniques to do it. Since you want the techniques, I assume you assume that they exist.

> 1. Every time I've confidently though "AI will never be able to do X" in the last year, I've later been proven wrong, so I'm a bit wary to assume that again without strong reasons.

That's evidence that you shouldn't assume something is impossible. I'm not suggesting that, either.

> 2. I see blog posts by some of the most AI-forward people that seems to imply some people are already managing large codebases without human review of raw code. Maybe they're full of crap - there are certainly plenty of over-credulous bs artists in the AI space - but maybe they're not.

Do you have any idea whether this works well though?

> 3. The returns on figuring this out are so incredibly high that, if it's possible, people will figure it out.

Ok. But again, that's a big if there.

The returns on breaking a popular cryptographic algorithm are also huge, but that's not an indication that it's possible, or that it's impossible for that matter.

I'm baffled why people think that "it would be great if..." has any bearing on the chances that the thing that follows is true.

wizzwizz4 144 days ago

1. Every time I've confidently stated "this AI architecture will never be able to do X" in the past 6 years, I've not been proven wrong (with one possible exception earlier today: https://news.ycombinator.com/item?id=47291893 – the jury's still out on that one). … No, my version doesn't really work, does it? It just sounds like bragging, or maybe hubris.

> some people are already managing large codebases without human review of raw code.

2. I have never believed this to be impossible. I do, however, maintain that these codebases are necessarily some combination of useless, plagiarism, and bloated. I have yet to see a case where there isn't a smaller, cheaper way to accomplish the same task faster and better.

> The returns on figuring this out are so incredibly high

3. And yet, they still haven't figured it out. My bias is that it isn't possible, because nothing has fundamentally changed about the model architectures since I first skimmed a PDF about GPT, and imagined an informal limiting proof that I still haven't found any holes in.

bigstrat2003 144 days ago

> Figuring out how to trust AI-written code faster is the project of software engineering for the next few years, IMO.

Or we could actually, you know, stop using a tool that doesn't work. People are so desperate to believe in the productivity boosts of AI that they are trying to contort the whole industry around a tool that is bad at its job, rather than going "yeah that tool sucks" and moving on like a sane person would.

whattheheckheck 144 days ago

Neural nets sucked in 1960s and if they gave up then we wouldn't be here

habinero 144 days ago

And the Concorde did not replace normal jet travel.

gjsman-1000 144 days ago

Do you know what happens to every industry when they get too fast and slapdash?

Regulation.

It happened with plumbing. Electricians. Civil engineers. Bridge construction. Haircutting. Emergency response. Legal work. Tech is perhaps the least regulated industry in the world. Cutting someone’s hair requires a license, operating a commercial kitchen requires a license, holding the SSN of 100K people does not yet.

If AI is fast and cheap, some big client will use it in a stupid manner. Tons of people can and will be hurt afterward. Regulation will follow. AI means we can either go faster, or focus on ironing out every last bug with the time saved, and politicians will focus on the latter instead of allowing a mortgage meltdown in the prime credit market. Everyone stays employed while the bar goes higher.

coffeefirst 144 days ago

He’s right. Exhibit A is age-gating social media. If the industry keeps being this careless that’s going to be the tip of the iceberg.

threatofrain 144 days ago

It's not just going to be software. We will absolutely be experiencing vibe law, vibe medicine, vibe legislation even. It'll be so much vibing that it's not worth saying the word anymore.

hackyhacky 144 days ago

> Regulation will follow.

I would hope so, but it won't happen as long as the billionaire AI bros keep on paying politicians for favorable treatment.

leptons 144 days ago

The word is "bribing", and the current (bribable) administration won't be around forever (hopefully).

disgruntledphd2 143 days ago

At the very least, the EU will regulate, and most other countries will copy. At that point the US will either need to regulate or watch it's software exports go to zero, with consequent impacts on their stock markets.

I've been saying this for a while, but within a generation, software will be as regulated as finance.

px1999 144 days ago

Very well said.

I think that "deciding what types of code can be reliably handed off to AI" might be missing from the list. It's orders of magnitude easier to nail 80% all the time than 100% all the time. I could see standalone products even developing in this space.

pjm331 144 days ago

My bet is that the last item is what we’ll end up leaning heavily on - feels like the path of least resistance

Throw in some simulated user interactions in a staging environment with a bunch of agents acting like customers a la StrongDM so you can catch the bugs earlier

user3939382 144 days ago

I made a distributed operating system that manages all of this. Not just for agents per se but in general allows many devs to work simultaneously without tons of central review and allows them to keep standards high while working independently.