| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by K0balt 46 days ago

This is essentially true. There are other ways to achieve this goal though, that don’t require exhaustive human review, better models are able to do that part as well if properly guided. The key is that yes, some of the design constraints will morph over time, necessarily, since coding is as often about discovering the problem as solving it. But design principles don’t drift. If you have a design principle that can not be adhered to, it is not a proper principle, it’s an opinion about the problem.

The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code. Notice that there is 3 separate things written about the code, and the code itself…. Keeping all of that correct, coherent, and consistent (with a separate, invariant document that describes your design principles) keeps the model from going off the rails and gives ample opportunity to sense bad smells before they get set in stone.

It’s a token fire and you need a minimum 250k context model… but I still get as much work done in an hour as I used to do in a day, and the code I coauthor is better documented, more maintainable, and more tested than any code I have ever written before.

1 comments

pron 45 days ago

> There are other ways to achieve this goal though, that don’t require exhaustive human review, better models are able to do that part as well if properly guided.

Not at this time. Even if you could somehow get their success rate to 90%, it's still far too low because the mistakes can be (and are occassionally) catastrophic. It's only when you review everything that you find mistakes that will bite you down the line. If you don't review everything, you just don't know, but the rate of bad mistakes introduced by the agents is too high to trust, no matter how much prompting and orchestration you do. Maybe future models will address that, but we're not there yet.

> The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code.

That's helpful but it doesn't solve the problem, which is that the agents are happy to introduce horrendous workarounds, and they don't tell you that the code they've written is a horrendous workaround. The docs are fine and reflect the code and the code reflects the strategy, but you just don't know that the strategy is wrong.

link

K0balt 45 days ago

I haven’t had this problem. Maybe it’s because of the language I’m using (C++) or maybe it’s because of the strict enforcement of modularity and public vs private interfaces, etc that I use? Also, the code is tested against the hardware with every change. Idk if that’s why my experience has been different from yours or not.

My workflow also requires a discussion of the architecture and methodology of each addition or change, but honestly because we define the interfaces first, and each concern is given its own .c and .h file, it’s very hard to sneak something in without me noticing and calling it out. (Which does happen occasionally)

I suspect that file level granularity may be one of the keys. It never is actually working on more than a couple hundred lines of code at a time, plus interfaces of related files. I end up with a hundred files where I might have had 30 coding by hand, but it is actually easier to reason about the code for me as well, and the number of files is not an issue because of the automation. Total LOC is about the same as I would produce by hand for the same work, which means it’s actually writing less, due to the interface overhead, so I’m pretty stoked about that. The only real nightmare for humans is the long includes.

OTOH if I don’t do all of this it will definitely go off the rails and produce garbage.

I’ve been writing c (and c++) for almost 40 years, and although that doesn’t mean I’m any good, it does mean I have developed a keen sense of smell and highly sensitive olfactory PTSD.

With the right structured environment, a SOTA model with a suspicious seasoned dev holding its hand can be easier to manage and much more productive than a small team. Or, maybe I’ve just sucked so bad my whole life that I can’t tell the difference, but at any rate it works well enough to ship without nightmares, and less bugs and patching than I had before.

Edit:

I should mention that if bugs get tricky, like hardware idiosyncrasies and things like that, the model just goes nuts.if I handle it very very carefully so that it does not try to understand the problem, and I just have it poke the firmware with a stick from a distance enough times and from enough angles, as long as I have successfully prevented it from trying to figure out the problem (which is not as easy as it seems like it would be) it actually will usually nail it. If it starts to guess it’s usually best just to roll back the context and start over with the poking (I have a harness so it does direct hardware probes)

There seems to be an analog for this for non hardware related issues, but it’s harder to sus out when you should be telling it that you specifically do not want it to attempt to understand or solve the problem until you’ve rigged and tested all of the debug messaging.

link

pron 42 days ago

I don't think our experience is different. Letting the agent work on pieces no bigger than a couple hundred lines at a time and checking if there's something fishy or not and that the code is legible and logical is close human supervision. This is very much not what the people who wish AI could build products for them do or can do at the rate they're moving.

link

K0balt 40 days ago

Lol I guess you’ve got a point , but honestly it’s not more supervision than I would give a junior dev, at least until they had developed at least a few months track record of good judgement.

I guess the problem is the blind assumption of competence?

I just think of AI as being a lot like my late friend Henry. Henry had several PHDs, was an accomplished polymath in a bunch of other subjects, and spoke more than 20 languages with reasonable fluency. He was for sure one of the smartest people I ever met.

He was also prone to drinking, and he when he was on a tear, you could barely tell except he would confidently say some of the most outrageous shit, or start speaking some other language without noticing. So you always took Henry with a grain of salt, and if it was important you’d double check. Even so, he was still an amazing resource to bounce things off of.

link