| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lmeyerov 37 days ago

Yes, being comprehensive, so early or blatant cheapo findings do not distract from other ones. That's important for base results. Splitting in both file and task is (currently) important.

Additionally, we run in a loop until it stops finding things, and as part of that, do test amplification when it does find any. We regularly see 3-8 rounds yielding valid results.

IMO half the value is customization to your repo, so copying these and specializing to your repo is super quick and pays off almost immediately . How to find style guides, how to run tests, what dimensions of correctness to look for, etc.

This kind of thing makes me question how important Mythos is for security bug finding - doing a High effort loop with a frontier model in code reviews until convergence has already outperformed human review for us . (Doesn't replace, but does find things we miss, and catches many we do see earlier).

1 comments

esperent 37 days ago

How do you prevent it from increasing scope?

That's the main issue I've found from running loops like this. Each loop has ~7 agents, say, looking through different lenses (security, UX, performance, etc.). Each one notes a few issues, each issue gets fixed, you do 5 to 8 loops, as you say. Each individual item that gets fixed looks minor but when you add it all up at the end you've increased PR size and scope significantly.

link

adamthegoalie 37 days ago

That is such a good point.

I recently opened a PR against this AI personal finance tool Ray https://github.com/cdinnison/ray-finance/pull/8 to add an Apple Card import feature, since Apple Card is not supported by Plaid.

I built the manual import feature, opened the PR, and then ran a code review.

What I hadn't thought about when I built the feature, was the myriad ways that the implications of importing data from Apple would have to be considered and integrated into the rest of the app, for the manual import to be a first-class feature, not "just a manual import" of data.

I ended up running adamsreview against it like 5-10 times, before considering it complete, as I learned that there was much more to the integration than I realized.

Now is that necessarily a problem? Maybe not. I should have realized from the start that the import feature was going to much more than just a small feature. But at least, thanks to the review loop, I got it completely right before the PR was merged.

link

lmeyerov 37 days ago

Yep, a few views here:

- one wave is code reduction via DRY removals and architectural fixes, and another is adverserial to get rid of false additions, so this helps AI bloat either way

- as the other comment says, underspecification is a problem, so this ends up finding when the implementation, tests, docs, quality guide, and spec are out of sync, with whichever to blame.

- Usable, well-designed, secure, and well-typed code ends up being bigger, so this helps cut to the chase. Ultimately, either you get there or you don't, and this helps cut review burden so you can do your part of it faster and at a higher level.

Funny enough, I'm now playing with gardening agents whose job it is to reduce code. But I wouldn't want to slow PRs on that so view as seperate PRs.

link

azurewraith 37 days ago

I've had similar experiences when I throw a bunch of agents at a problem... some things get flagged but a lot gets truncated in the summarization step. Per-phase constraints solve this naturally, and I think the problem is better suited to be solved serially. Have each specialized 'review' phase scoped to only read and annotate (even better with a code-owners style read scoping) with max iterations in deterministic code. The scope can't creep past the constraints you've set for it. Scope explosion comes from agents having unbounded tool access and no transition gates between phases... it will overreach if given the opportunity to

link