| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chadd 35 days ago

We're working on a large Rust codebase, heavily assisted development with Claude and Codex, and one critical workflow is after you have written a spec, have the other LLM critique it thoroughly.

This back and forth will take quite a while, but the resulting implementation plan will be 10x better than the original.

You can automate this by giving Codex a goal, and a skill to call Claude to review the implementation spec until they both agree it's done.

Then, for critical code, have them both implement the spec in a worktree, then BOTH critique each other's implementation.

More often than not, Claude will say to take 2 or 3 pieces from it's design over to Codex, but ship the Codex implementation.

8 comments

Aurornis 34 days ago

I take this idea even further: After the LLMs have critiqued each other, I introduce a third critique and review it myself as a human. This third party review is most effective at highlighting problems that the LLMs miss, in my experience.

Jokes aside, I agree about having LLMs iterate. Bouncing between GPT and Opus is good in my experience, but even having the same LLM review its own output in a new session started fresh without context will surface a lot of problems.

This process takes a lot of tokens and a lot of time, which is find because I’m reviewing and editing everything myself during that time.

knivets 34 days ago

This is astrology for devs.

keeganpoppen 34 days ago

as someone who is about as llm-forward as anyone out there, this is a brilliant analogy. was equally true of all the “prompt engineer” hype as well from a couple years ago (which i admit i still think does matter)… it kinda makes me feel like an audiophile / hi-fi person talking about how 24bit/192kHz is the one true encoding format and anything less is a willfull (cynical, “Quality”-hating, satisficerist, etc.) compromise. which i freely admit to being one of those people as well.

and in both cases i both “know” that i can tell the difference and “know that i cannot tell the difference”. what anyone takes from that in terms of what it says about me, personally, is a bit of a Rorschack test, but Astrology is about as apt a description as there is… xD

kimixa 34 days ago

For higher than audible frequency sample rates there's a good chance you can tell the difference. It often causes weird aliasing and harmonics in the more audible frequencies on "real" playback equipment. You can train yourself to recognize some of these and often pretty accurately identify the higher sample rate examples. You might even mentally associate those signs with "Higher Quality".

But it's arguably less accurate to the original recording.

raincole 34 days ago

People though asking LLM to output the reasoning steps was astrology until it's standardized and made ubiquitous.

andai 33 days ago

Didn't multiple studies find the reasoning traces didn't have much to do with the final output? And even that outputting placeholder tokens during reasoning has a similar beneficial effect on benchmark scores?

(I don't think that's the full picture but, there's definitely something fishy going on there.)

tensegrist 33 days ago

reasoning itself just affords the model a ton of extra forward passes / "time to think"

the, como se dice, "misalignment" between the content of reasoning tokens and the actual output following the end of the reasoning is a separate problem, extensively studied by e.g. Anthropic

soloto 34 days ago

Do they have a golden calf to dance around? Without that success will be hit and miss.

keeganpoppen 34 days ago

i mean, maybe the golden calf people were right the whole time lol

Pay08 34 days ago

Right about what?

embedding-shape 34 days ago

Unless you can somehow provide some arguments against it, I feel like you're the one who is trying to cargo-cult stuff here.

Say what you will with proper reasoning or arguments if you feel compelled, tired reddit-commentary like that helps no one.

johnnyanmac 34 days ago

> Unless you can somehow provide some arguments against it,

We're year 4 into this discussion and camps have only gotten more bifrucated. There's no 1-1 discussion to have about this as of now, at least not before the crash.

Your only hope in such discourse is not trying to convince the other party how wrong they are, but appealing to an as of yet undecided party. Be it with reason, or simply pointing out how absurd some comments sound to the average person.

embedding-shape 34 days ago

> Your only hope in such discourse is not trying to convince the other party how wrong they are

I don't care about convincing anyone, the ones I reply to or others, but if you take the time to leave a comment, at least make it something to read and think about instead of soundbites like "This is astrology for devs", it's plain boring to read and makes HN worse.

johnnyanmac 34 days ago

>I don't care about convincing anyone

That's fine. Others will care for you.

>it's plain boring to read and makes HN worse.

I chuckled at the joke. Surprising amount of layers to it.

Though I never strove to be a comic nor writer, that kind of terse, compact punch makes me envy those of such literary talent.

embedding-shape 34 days ago

> I chuckled at the joke. Surprising amount of layers to it.

What joke?

keeganpoppen 34 days ago

i legitimately cannot divine what you are saying at all with this. there are so many dangling antecedents and modifiers that it is completely impossible. and i say this out of a genuine desire to understand what your argument is, knowing full well that i likely disagree with it.

embedding-shape 34 days ago

Alright, let me explain, hopefully simpler: GP made told us their experience with working with LLMs, and some pointers to what they found to be working. The comment I replied to just says "This is astrology for devs" which basically is a cheap putdown without any reasoning nor arguments for why the commentator believes so. My comment is urging them to actually participate in the discussion, not just post their soundbite they thought of in five seconds, so HN as a whole can remain good instead of devolving into reddit (which is a tale as old as HN, I know).

Hopefully it's understandable now, and hopefully you don't disagree :)

beepbooptheory 33 days ago

https://news.ycombinator.com/newsguidelines.html

> Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills

satvikpendem 33 days ago

Indeed, with the corollary of, please don't write Reddit-tier comments on HN either, then one wouldn't have to say it's turning into Reddit.

embedding-shape 33 days ago

Awesome, you did understand the reference I made, I was afraid I was too sneaky about it but seems it was just clear enough :)

munksbeer 34 days ago

You can't be serious. It couldn't be more obvious what the poster was referring to, a drive by put-down comment with no attempt to discuss anything seriously is more highly upvoted than an objection to such a comment.

What is this place for? Dang tells us, curious discussion. The guidelines explicitly state that certain comments are not in the spirit.

But the community seems to have decided otherwise, which is a shame.

embedding-shape 34 days ago

Don't read too much into it, downvotes/upvotes are highly random here, saying the same thing twice will have different reactions depending on the time of day and the topic of the submission, seems certain crowds are drawn to certain topics, which isn't that surprising.

I don't mind the downvotes, the points aren't really the reason I'm here anyways, I just want fun and interesting discussions with people and read other's perspectives, the points don't hinder that :)

giancarlostoro 34 days ago

This is precisely how I used to use Beads before I made GuardRails (I wanted something slightly simpler, but similar with more 'guard rails'). I braindump everything I want to build, I ask Claude to do market level research. I then ask Claude to ask clarifying questions, when I ask Claude to be critical of its conclusions and provide the top options and to justify it. I also question Claude and say its okay to disagree with me, be critical, I just want to understand.

By the end you have piecemeal "tickets" for your coding agent, if you have multiple developers you can sync them all up into github, and someone could take some locally, or you can just have Claude work on all of them with subagents. The key feature there is because its all piecemeal the context stays per task.

Then I run a /loop 15m If you're currently working ignore this. Start on the next task in gur if you have not. If you finished all work and cannot pass one gate, work on the next available task.

(Note: gur is my shorthand for GuardRails)

I also added a concept called "gates" so a task cannot complete without an attached gate, gates are arbitrary, they can be reused but when assigned to a task those specific assignments are unique per task. A task is basically anything you want it to be: unit test, try building it, or even seek human confirmation. At least when I was using Beads it did not have "gates" but I'm not sure if it has added anything like it since I stopped using Beads.

Claude will ignore the loop if it's currently working, and when its "out of work" it will review all available tasks.

If anyone's curious its MIT Licensed and on GitHub:

https://github.com/Giancarlos/guardrails

digitaltrees 34 days ago

I’ll check this out. I might integrate it in to my IDE (www.propelcode.app) as a complement to plan mode.

keeganpoppen 34 days ago

oh man my body is ready for any post-beads ideas… i will definitely check this out

ai_fry_ur_brain 35 days ago

I hate how seriously people take the output of an LLMs or how reliable they think it is.

Have Claude produce that spec 10 times, use the same prompt and same context. Identical requests, but you'll get 10 unique answers that wil contradict each other with each response seeming extermely confident.

Its scary how confident you people are in these outputs.

CrazyStat 35 days ago

If you ask 10 different humans to produce the spec with the same information (prompt and context) they will also produce 10 unique answers that will contradict each other and (depending on who you asked) may be just as confident.

There are real decisions to be made when going from a vague prompt to a spec. It's not surprising that an LLM would produce different specs for the same work on different runs. If the prompt already contained answers to all the decision points that come up when writing the spec then the prompt would already be the spec itself.

b40d-48b2-979e 35 days ago

LLMs aren't people. They don't reason. They're token generators, a black box. Your analogy falls on its face with any scrutiny.

CrazyStat 35 days ago

I didn’t claim that LLMs are people or that they reason.

If the behavior of the llm is the same as the behavior of reasonable people then the behavior of the llm is reasonable, regardless of how black of a box they generate tokens out of.

Reasonable people will generate divergent specs for the same prompt. Thus it is reasonable for an LLM to generate divergent specs out of the same prompt.

Edit: I use “reasonable” here in the legal sense of the “reasonable person” standard, not to imply any reasoning process.

digitaltrees 34 days ago

Aren’t people pattern matching neural networks as well? Why does being a token generator mean something is unreliable?

Further, why does that mean “it doesn’t reason”. Logic can be encoded in language, symbols or code. If I say “all apples are red” -> “all fruit in the bowl are apples” -> “therefor all the fruit are red”. It doesn’t really matter if I understand the logic or what red is or fruit/apples are, the logic is contained in the structure of the syntax. If an LLM can output the conclusion reliably from predictive operations it is able to have the effect of reason and we don’t need to know or care about whether it “understands” the reasoning.

keeganpoppen 34 days ago

no, brah, humans are TOTALLY different. just don’t think about it too hard. we are just special.

jatora 35 days ago

it's an analogy, it didnt fall on its face at all. it's just a comparison to highlight the point being made was nonsensical. example: you're just a next action generator controlled by trillions of cells and subconscious dna-based behavior. a black box.

svieira 35 days ago

> you're just a next action generator controlled by trillions of cells and subconscious dna-based behavior.

With moral agency and the ability to learn (even if we presume you are correct, which I don't think you are).

jatora 34 days ago

moral agency and the ability to learn are implicit in the description you quoted. this isn't some special superpower, all animals have the ability to learn, and many have moral agency. these aren't human specific traits

b40d-48b2-979e 35 days ago

Reductio ad absurdum.

jatora 34 days ago

exactly my point lol

NobleLie 34 days ago

It appears they don't need to reason or be intelligent to be able to produce working solutions for code. But sure let wild and unmonitored? I wrangle my LLMs like the code monkeys they are. They help materialize code and then you need to sculpt it (and test harness of varying sorts)

It really can be useful. It's very different from old world programming.

keeganpoppen 34 days ago

why do people insist on claiming that they don’t reason, when they clearly, for all intents and purposes, do. you can be vague; you can express your idea a thousand different ways, and you will get a unique blend of <your input bits> x <hidden reasoning layer> => semi-smoothed output. this is like some Searle Chinese Room bullshit that needs to just die. it is beyond clear that llms can interact with abstract concepts in an extremely meaningful way. this is like the “thought leader” version of the stupid-ass “it’s just smart autocomplete” argument. if you think that, it is user error— either a failure of creativity or a failure of perception or both. just because llms are not a panacea and are problematic for society and “overhyped” and whatever does not make it intellectually honest to claim that there is zero reasoning/creativity/cognition within the box.

dnautics 35 days ago

LLMs do reason (they just sometimes don't reason well).

I assure you I've met many devs and "engineers" that reason less than LLMs, and are black boxes, especially in terms of the code they write.

claytongulick 34 days ago

> LLMs do reason

No, they don't.

They are token predictors that use statistical techniques to emit the randomly weighted next most likely token given the previous token list.

The result is a strange mimic of human reasoning, because the tokens it predicts are trained on strings that were produced by humans that were reasoning, but that's not the same thing.

Human cognition is complex and poorly understood, and the nature of the mind is an area of study almost as old as consciousness itself. We don't know exactly how it works, or what its exact relationship to the brain is, but we do know that it is not a simple token predictor.

LLMs, by their very nature are constrained to the concept of language and the relationship between existing words in a corpus. This is a box they can not escape.

Modern neuroscience suggests that the human brain is much more vast than that, and in many ways looks like it is constrained by language, but certainly not limited to it.

antonvs 34 days ago

> They are token predictors that use statistical techniques to emit the randomly weighted next most likely token given the previous token list.

Sounds like an implementation detail. Now describe how human reasoning works and explain why that process of chemical and electrical signals results in "reasoning" whereas what LLMs do isn't.

The problem with being this reductive is you can do it to anything, including humans. You can’t be reductive about LLMs and refuse to be reductive about humans - that's poor reasoning, and an LLM would out-reason you on this point, further negating your case.

dnautics 34 days ago

You have moved goalposts from reasoning to "human cognition". I won't tolerate that sort of slippery wordplay.

Reasoning is making analogies between logical patterns found in conceptual space, with a direction of time (statements precede conclusions). For example. A => B and B => C. You may now deduce A => C. For something fuzzier, A~D and B~E, you may now deduce that D~=>E. This is the sort of thing that higher layer attention mechanism is capable of doing.

> This is a box they can not escape.

Would you say that Helen Keller was less capable of abstract reasoning because she had more constrained access to sensory input?

digitaltrees 34 days ago

The structure of language encodes logic in many ways. So the models ability to reason may be an emergent property of the reasoning ability humanity has ejected an extracted from our neural networks and abstracted into language a symbols.

keeganpoppen 34 days ago

there is absolutely no line of demarcation between human reasoning and what you described

IshKebab 34 days ago

Wow, there are still people trying to claim they don't reason. What will they have to do before you'll admit that they can?

esailija 34 days ago

You are asking the wrong question. It's not about if you can do X which can be faked especially if you are given practically infinite tries and all failures are hidden.

The people who want to believe they actually reason just ignore all obvious evidence of contrary and cherry pick the times reasoning was faked well enough.

The people who don't want to believe will just take a second to understand how they work and then come up with ways to reveal they were faking all along. Like asking how many letters there are in a word lol.

It's only the people who don't want to believe that count because reality is what happens despite of what you believe.

IshKebab 34 days ago

You seem to believe that something is only "reasoning" if it works in a particular way. That it's not enough for it to observationally display reasoning skills; it has to be using a particular method to do that so it's not "faking" it. Is that correct?

Jtarii 33 days ago

It will be interesting to see the excuses people come up with when LLMs innevitably start solving millenium prize problems.

Jtarii 34 days ago

They very obviously reason.

dnautics 34 days ago

it's kind of crazy to think that the transformer architecture can't encode some primitive form of reasoning.

johnnyanmac 34 days ago

The issue is Lllms don't learn, despite the name. A human re-implementing a spec would strive to iterate towards what they feel is a better spec. They can take in their own input and self-correct. The work of implementing the spec gives insight into pain points and strengths, even if they never actually test the spec (they 100% should, but this is to emphasize that struggle for humans is in itself iteration, even before external feedback comes in).

An LLM is isn't deterministic but also isn't iterative without an existing human. You give it the same spec 10 times and it produces 10 results that aren't far off itself but vastly different when you go into the weeds. And not different in a way of improvement. |

olafmol 35 days ago

An LLM should not "generate specs", a human should. The LLM can work from the specs. It can never infer meaning from a vague prompt. If so, it will start guessing. Every human that ever did functional specification or information analysis at some point knows this. Or has learned the hard way, something with assumptions and asses ;)

dist-epoch 34 days ago

The guessing of a LLM for a vague prompt is better than the one of your average developer.

A prompt like "write these two files on disk" will very likely make the LLM do some sort of an atomic write/swap operation, unlike the average developer which will just write the two files and maybe later encounter a race condition bug. You can argue the LLM output is overkill, but it will also be more robust on average.

rixed 34 days ago

What kind of race condition do you have in mind?

skydhash 35 days ago

So what’s most important is knowing those parameters and the ranges of values, not having the final result. A human, after producing a specs, can the provide the mental model of how he created the specs. Where the inflection points are and what the range of valid results.

What has always mattered is how you decide the specs, not the specs in themselves.

claytongulick 34 days ago

> If you ask 10 different humans to produce the spec with the same information (prompt and context) they will also produce 10 unique answers

But they didn't ask humans, they asked a machine. We expect our machines to behave in predictable ways.

> If the prompt already contained answers to all the decision points that come up when writing the spec then the prompt would already be the spec itself.

This is one of the best arguments against using LLMs I've seen.

It reduces to the classic argument- at the point where you've described a problem and solution in sufficient detail to be confident in the results, you've invented a programming language.

CrazyStat 34 days ago

> We expect our machines to behave in predictable ways.

I expect LLMs to produce randomly varying output. Maybe it's the thousands of hours I spent doing monte carlo simulations for my PhD.

> This is one of the best arguments against using LLMs I've seen.

> It reduces to the classic argument- at the point where you've described a problem and solution in sufficient detail to be confident in the results, you've invented a programming language.

I'm not an LLM true believer, but I use codex for various small tasks and it often (not always) does a thoroughly decent job. Yesterday I gave it a pretty vague request to set up a new Home Assistant dashboard and it handled it just fine--I told it what I wanted to see but it figured out itself which helper variables it would need to set up to realize that vision and wrote all the config for it.

I probably could have done it in 15 minutes if I was familiar with Home Assistant's yaml configuration schema and all, but I'm not so it probably would have taken me closer to an hour. Asking codex took me 30 seconds and it did just fine.

I am skeptical that LLM's are going to kill all white collar jobs or whatever anytime soon. Not being able to truly learn things is an issue. Reality has a surprising amount of detail[1], and while codex does well at things like writing Home Assistant configs and setting up a Minecraft server, where there are thousands of examples online of how to do it, when I've asked it to do some more esoteric things it has sometimes failed spectacularly. I don't think having the LLM keep notes and then read them back (filling up the context window) is a real solution here.

[1] http://johnsalvatier.org/blog/2017/reality-has-a-surprising-...

claytongulick 34 days ago

I haven't made the argument that LLMs aren't useful, I can see cases where they are.

I don't think they include areas where correctness, determinism or human reasoning are important.

At least, not in isolation.

dxxvi 34 days ago

> It's not surprising that an LLM would produce different specs for the same work on different runs This is what I don't understand: AI is a computer program with its own data. If we give the same input to that computer program every time, why does it produce different outputs every time? Or does the input include LLM data + our prompt + some random data that computer program picks from its Internet search?

CrazyStat 34 days ago

LLMs have a temperature parameter. At zero temperature they are deterministic: they always choose the most likely next token at each step based on what came before and the model weights, and they will always generate the same output given the same input.

As you raise the temperature they will start (pseudo)randomly choosing tokens other than the single most likely token (though that one will still be the most likely to be chosen). It turns out this is almost always better than zero temperature, which has a tendency to get caught in repetitive loops. I imagine all the frontier labs have spent thousands (millions?) of CPU hours tuning the temperature parameters on their models for optimal performance.

thesz 34 days ago

  > LLMs have a temperature parameter. At zero temperature they are deterministic: they always choose the most likely next token at each step based on what came before and the model weights, and they will always generate the same output given the same input.

https://en.wikipedia.org/wiki/Softmax_function

"A value proportional to the reciprocal of β is sometimes referred to as the temperature: β = 1/kT, where k is typically 1 or the Boltzmann constant and T is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating."

"Temperature" in the context of softmax does not change a "winning" token, it changes how much probable (in the sense of softmax distribution) winning token will be. If the winning token is "New York", it will be a winner with temperature close to 0 and with temperature of 1e9.

The actual selection of the random token is done separately by using inputs outside of the softmax distribution, for example, by using random number generator. I believe most of LLM configs have a seed for the random number generator.

More than that, generation of code in most programming languages is done with the more guardrails such as beam search guided by schema, syntax and semantics.

NobleLie 34 days ago

Nah. Even with zero temperature this is still variation.

digitaltrees 34 days ago

But those differences fall within a band of generally accepted results don’t they? And the cost to throw the code away and reimplement is low now. So maybe it doesn’t really matter if the implementation is perfect or identical.

That being said I agree people trust AI too much. Especially people with less experience. It’s easy to forget the models are mirrors of we are as the drivers of the input context not mentors that will guide us to best practices reliably.

Robdel12 35 days ago

Imagine making this your entire identity

motoboi 35 days ago

I strongly believe you don’t need to call another model for that. The same model can do result fine. Just not as part of the same context.

I mean that if you ask codex on gpt 5.5 to submit to a plan reviewer subagent that uses gpt5.5, this is enough to have a very good reviewing and reassessment of the plan.

My hypothesis is that it’s even better than opus.

The reason why submitting the product of one LLM to another to review is that you need a fresh trajectory. The previous context might have “guided” the planer into some bias. Removing the context is enough to break free from that trajectory and start fresh.

DeathArrow 34 days ago

>We're working on a large Rust codebase, heavily assisted development with Claude and Codex, and one critical workflow is after you have written a spec, have the other LLM critique it thoroughly.

I do this with other languages, too, not just Rust. Thing is, you have to put a hard stop at some point because the models will always find something to nitpick.

slopinthebag 34 days ago

It's incredible how much developers will do to avoid having to look at or think about code.

lstodd 34 days ago

What is incredible is that these people have the gall to call themselves developers.

AnimalMuppet 35 days ago

The return of pair programming.