Hacker News new | ask | show | jobs
by latentsea 17 days ago
> Yes, I don't have anything important to say other than I 100% agree with this comment. AI in its current state is akin to Stack Overflow and Google on steroids, but from my experience, it doesn't do well building out full-scale applications other than perhaps some initial scaffolding.

We're currently using it to build out a full-scale application. It does as well as you care to coax into doing tbh. You have to invest heavily in harness engineering, and at least my experience has been that as you do that, the results improve.

2 comments

>It does as well as you care to coax into doing tbh. You have to invest heavily in harness engineering, and at least my experience has been that as you do that, the results improve.

That is also my experience.

When starting a project I observe how the agent fails, I add new rules to the harness to prevent it from falling and repeat the process until I am happy with the output.

I'm unfamiliar with harness engineering. Is there any good documentation about the subject you could point me to?
https://openai.com/index/harness-engineering/

https://www.anthropic.com/engineering/harness-design-long-ru...

https://www.anthropic.com/engineering/effective-harnesses-fo...

These were some of the first major articles on it. It's becoming a popular topic, so there's more content on it all the time.

I can't point you to a good complete documentation, because the field is changing very fast as people make new discoveries.

I learned by reading articles, success stories failure stories and mostly by doing, trying stuff, see how it works and adjusting it and burning a lot of tokens along the way.

What I would do in your shoes, I would ask an AI chat to find new articles on the matter (including on HN), explain how Codex, Claude, Pi are managing agents.

My compressed view is: you need to have a great specification both business and architecture wise that doesn't leave anything important for the model to guess because chances are it will make the wrong choices. That comprehensive spec should not be in one huge chunk. Have your plan divided in phases that each fit in a context window and have the spec for each phase. Use TDD, strive for 100% coverage. Force the model to behave: if it doesn't do what is supposed to, give it feedback and ask it to retry and don't allow it to progress to the next stage unless everything is perfect. I also like to write comprehensive integration tests before building anything. The agents are not allowed to touch or read the integration tests, only run them and they will get feedback where the tests fail. I like to build the integration tests in a different language than the software I am building, to make sure there isn't something platform specific that the tests rely on. I use C#, Go, Rust and Zig for development and Python for the integration tests.

For now, to get good results, I can't just copy and paste the setup from a project to another, I have to work a lot to tailor the process for each new codebase.

And that's why I am working on an agent harness to try to force the agents to do the right things in most common development scenarios without wasting much tokens. By common development scenarios I mean that is a large goal, right now I am working towards backend web development and microservices.

Sounds like bag pipes to me LOL
In my experience, you’ll eventually hit a context window issue and it will just start spouting gibberish/doing wrong things, and nothing will significantly improve it. But hey, maybe it’s improved.
Well, auto-compaction is a thing in Claude Code now. Plus we have /goal command and some automated review stuff, so you can kinda just get it to loop until the automated reviews are satisfied and CI is passing. Does most of the heavy lifting.