| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by faangguyindia 7 days ago
	If you've codex what does it add over codex's default app? I am confused. Can't you simply ask codex in another tab to just do a code review?

6 comments

eranation 7 days ago

Developers should definitely use whatever tool they use to review the code they (or the tool) just wrote. We have a skill that does this in a loop - spin subagents, review (based on our coding standards), triage the review in another subagent, fix what's applicable, push back on what's not, and we run this in a loop. This is before you even open a PR.

The idea of a PR is for others to find things that you have a blind spot to, and also leave some paper trail on the thought process. E.g. if something was not fixed, there is a history of a comment and a reason on WHY it wasn't fixed. If you do all that only locally, that context is lost.

We noticed that even after doing this self review loop multiple times, we still find issues (either via other models / tools or via humans that have the "tribal knowledge")

Maybe one day AI will write perfect code and can review itself, but even if it's 0.1% chance it has a bug, or 1 in a million it will do something a bit sinister (like open a backdoor just in case you try to shut it down) - then I really think there is always going to be a need for humans to review something.

link

cheema33 7 days ago

> Can't you simply ask codex in another tab to just do a code review?

You are likely to get better results if you do not use the same model for review that wrote the code. I typically use Opus for code editing and GPT 5.5 for peer review using an automation with skills.

Training set is different between models. If there are gaps in coverage in one model, you want a different model reviewing the work. The second model will its own gaps, but the gap list is not identical.

link

sdevonoes 6 days ago

> You are likely to get better results if you do not use the same model for review that wrote the code

There’s no evidence of this. I guess you are anthropomorphising models (i.e., it’s good that - different human reviews your code)

link

embedding-shape 6 days ago

Yeah, one model over another seems to matter less, they respond differently to the same prompts, so if anything, I'd use multiple prompts over choosing one model over another.

However, using two models to generate two reviews easily beats doing one model and one review, as some models seem to "care" more about certain things, but you'll just miss different things if you change the model rather than add more.

link

tylermarques 6 days ago

There is some evidence.[1] The best reviewer is a different model with fresh context, worst is same model with same context.

1. https://arxiv.org/pdf/2603.04582

link

dominotw 6 days ago

well they are different. human or not. so it makes sense to get it reviewing by "something" different that one that wrote code.

link

krzyk 7 days ago

Results also depend on the prompt. You get different results if you ask to review the PR and focus on particular file than if you don't make it focus.

Or if you make it "be a security engineer" with particular focus points.

Or make it a grammar nazi, it will find way more typos than without such focus.

Of course all of those "focuses" needs to be in a separate context (agent/subagent) to make it work.

link

Art9681 7 days ago

I would suggest that you reverse those roles. gpt-5.5 as the implementer and Opus as the reviewer.

link

hombre_fatal 7 days ago

They find different things, and there's no reason to use one model for review. You want to review it until there's nothing left to be unearth.

And if you put the review effort into polishing an impl plan, then it doesn't matter which model implements it either.

link

pluralmonad 7 days ago

How come? I find Opus to have better taste and GPT to have more rigor.

link

pramodbiligiri 6 days ago

Mechanics of running their command aside, I think the main value add is all the rules: https://github.com/alibaba/open-code-review/tree/main/intern...

Like with "SKILL" files in general, it's got to do with Prompt Engineering: https://en.wikipedia.org/wiki/Prompt_engineering#Rationale

link

eyeris 7 days ago

Presumably nothing. Do note the publisher—Alibaba presumably would rather their own tools and models instead of licensing.

They do open source a fair bit of internal tooling, so it’s always interesting to see their approach

link

esafak 7 days ago

We'd need a benchmark to tell.

link

krzyk 7 days ago

It can be used outside of local machine.

We built something similar, it looks for new PRs where the bot is added and does reviews. Makes the code more tuned toward similar rules. I can't assume that a developer run a code review tool himself (just as I don't assume he/she run a build - so we run builds also).

It is just another perspective for code review, besides human. Unfortunately it uses a lot of tokens, and considering that Anthropic, OpenAI and Github Copilot all moved to token based pricing, it is quite a money burner.

link