Hacker News new | ask | show | jobs
by rfw300 111 days ago
Why should it be? The agent session is a messy intermediate output, not an artifact that should be part of the final product. If the "why" of a code change is important, have your agent write a commit message or a documentation file that is polished and intended for consumption.
7 comments

This reduces down to the problem of summarization - a quite difficult one. At commit time it’s difficult to know what questions readers will have. You can get close but never all the way there.

Pre AI when engineers couldn’t find the answer in commit messages or documentation they would ask the author “why” and that human would “compute” the summary on demand.

I think that’s what I expect to do with these agent sessions - I don’t want more markdown, I want to ask it questions on demand. Git AI (https://github.com/git-ai-project/git-ai) uses the prompts that way. I think that model will win out. Save sessions. Read/ask questions relevant to the current agent’s work.

On asking peers. This is regrettably on the way out today - I’ll ask engineers about complex code they generated and they can’t give good answers. I think it’s because it all happened so fast — they didn’t sit with the problem for 48 hours. So even if they steered the agent thoughtfully it’s hard to remember all the decisions they made a week later.

It should be a distillation of the session and/or the prompts, at bare minimum. No, it should not include e.g. research-type questions, but it should include prompts that the user wrote after reading the answers to those research-type questions, and perhaps some distillation of the links / references surfaced during the research.

Prompts probably should be distilled / summarized, especially if they are research-based prompts, but code-gen prompts should probably be saved verbatim.

Reproducibility is a thing, and though perfect reproducibility isn't desirable, something needs to make up for the fact that vibe-coding is highly inscrutable and hard to review. Making the summary of the session too vague / distilled makes it hard to iterate and improve when / if some bad prompts / assumptions are not documented in any way.

You have the source code though. That is the "reproducibility" bit you need. What extra reproducibility does having the prompts give you? Especially given that AI agents are non-deterministic in the first place. To me the idea that the prompts and sessions should be part of the commit history is akin to saying that the keystroke logs and commands issued to the IDE should be part of the commit history. Is it important to know that when the foo file was refactored the developer chose to do it by hand vs letting the IDE do it with an auto-refactor command vs just doing a simple find and replace? Maybe it is for code review purposes, but for "reproducibility" I don't think it is. You have the code that made build X and you have the code that made build X+1. As long as you can reliably recreate X and X+1 from what you have in the code, you have reproducibility.
> You have the source code though. That is the "reproducibility" bit you need.

I am talking about reproducing the (perhaps erroneous) logic or thinking or motivations in cases of bugs, not reproducing outputs perfectly. As you said, current LLM models are non-deterministic, so we can't have perfect reproducibility based on the prompts, but, when trying to fix a bug, having the basic prompts we can see if we run into similar issues given a bad prompt. This gives us information about whether the bad / bugged code was just a random spasm, or something reflecting bad / missing logic in the prompt.

> Is it important to know that when the foo file was refactored the developer chose to do it by hand vs letting the IDE do it with an auto-refactor command vs just doing a simple find and replace? Maybe it is for code review purposes, but for "reproducibility" I don't think it is.

I am really using "reproducibility" more abstractly here, and don't mean perfect reproducibility of the same code. I.e. consider this situation: "A developer said AI wrote this code according to these specs and prompt, which, according to all reviewers, shouldn't produce the errors and bad code we are seeing. Let's see if we can indeed reproduce similar code given their specs and prompt". The less evidence we have of the specifics of a session, the less reproducible their generated code is, in this sense.

You are talking about documenting the intent of a piece of software if I understand correctly. But isn't that what READMEs and comments are for?
It's not reproducible though.

Even with the exact same prompt and model, you can get dramatically different results especially after a few iterations of the agent loop. Generally you can't even rely on those though: most tools don't let you pick the model snapshot and don't let you change the system prompt. You would have to make sure you have the exact same user config too. Once the model runs code, you aren't going to get the same outputs in most cases (there will be date times, logging timestamps, different host names and user names etc.)

I generally avoid even reading the LLM's own text (and I wish it produced less of it really) because it will often explain away bugs convincingly and I don't want my review to be biased. (This isn't LLM specific though -- humans also do this and I try to review code without talking to the author whenever possible.)

> I am talking about reproducing the (perhaps erroneous) logic or thinking or motivations in cases of bugs

But "to what purpose" is where this all loses me. What do you gain from seeing what was said to the AI that generated the bug? To me it feels like these sorts of things will fall into 3 broad categories:

1) Underspecified design requirements

2) General design bugs arising from unconsidered edge cases

3) AI gone off the rails failures

For items in category 1, these are failures you already know how to diagnose with human developers and your design docs should already be recorded and preserved as part of your development lifecycle and you should be feeding those same human readable design documents to the AI. The session output here seems irrelevant to me as you have the input and you have the output and everything in between is not reproducible with an AI. At best, if you preserve the history you can possibly get a "why" answer out of it in the same way that you might ask a dev "why did you interpret A to mean B", but you're preserving an awful lot of noise and useless data int the hopes that the AI dropped something in it's output that shows you someplace your spec isn't specific or detailed enough that a simple human review of the spec wouldn't catch anyway once the bug is known.

For category 2, again this is no different from the human operator case and there's no value that I can see in confirming in the logs that the AI definitely didn't consider this edge case (or even did consider it and rejected it for some erroneous reason). AI models in the forms that folks are using them right now are not (yet? ever?) capable of learning from a post mortem discussion about something like that to improve their behavior going forward. And its not even clear to me that even if they were, you would need the output of the session as opposed to just telling the robot "hey at line 354 in foo.bar you assumed that A would never be possible, but no place in the code before that point asserts it, so in the future you should always check for the possibility of A because our system can't guarantee it will never occur."

And as for category 3, since it's going off the rails, the only real thing to learn is whether you need a new model entirely or if it was a random fluke, but since you have the inputs used and you know they're "correct", I don't see what the session gives you here either. To validate whether you need a new model, it seems that just feeding your input again and seeing if you get a similar "off the rails" result is sufficient. And if you don't get another "off the rails" result, I sincerely doubt your model is going to be capable of adequately diagnosing its own internal state to sort out why you got that result 3 months ago.

The source code is whatever is easiest for a human to understand. Committing AI-generated code without the prompts is like committing compiler-generated machine code.
> It should be a distillation of the session and/or the prompts, at bare minimum.

Huh, I thought that's what commit message is for.

I mean, sure, a good, detailed commit message is perfectly fine to me in place of the prompts / a session distillation. But I am not holding my breath for vibe-coders to properly review their code and make such a commit message. But, if they, do, great! No need for prompt / session details.
In my case I have set up the agent is the repo. The repo texts compose the agent’s memory. Changes to the repo require the agent to approve.

Repos also message each other and coordinate plans and changes with each other and make feature requests which the repo agent then manages.

So I keep the agents’ semantically compressed memories as part of the repo as well as the original transcripts because often they lose coherence and reviewing every user submitted prompt realigns the specs and stories and requirements.

Completely agree. Until recently I only let LLMs write my commit messages, but I've found that versioning the plan files is the better artifact, it preserves agentic decisions and my own reasoning without the noise.

My current workflow: write a detailed plan first, then run a standard implement -> review loop where the agent updates the plan as errors surface. The final plan doc becomes something genuinely useful for future iterations, not just a transcript of how we got there.

post mortems / bug hunting -- pinpointing what part of the logic was to blame for a certain problem.
this is what granular commits are for, the kilobytes long log of claude running in circles over bullshit isn't going to help anyone
I think the parent comment is saying “why did the agent produce this big, and why wants it caught”, which is a separate problem from what granular commits solve, of finding the bug in the first place.
There is no "why." It will give reasons but they are bullshit too. Even with the prompt you may not get it to produce the bug more than once.

If you sell a coding agent, it makes sense to capture all that stuff because you have (hopefully) test harnesses where you can statistically tease out what prompt changes caused bugs. Most projects wont have those and anyway you don't control the whole context if you are using one of the popular CLIs.

If I have a session history or histories, I can (and have!) mine them to pinpoint where an agent either did not implement what it was supposed to, or understand who asked for a certain feature an why, etc. It complements commits, sessions are more like a court transcript of what was said / claimed (session) and then you can compare that to what was actually done (commits).
Some of my sessions are over 1GB at this point. I just don't think this scales usefully or meaningfully. Those things should live as summarized artifacts within issue tracking IMHO
Then look at the code, the session will only confuse. To read an LLM's explanation is to anthropomorphize what will just be a probabilistic incident.
no you look at the session to understand what the context was for the code change -- what did you _ask_ the llm to do? did it do it? where did a certain piece of logic go wrong? Session history has been immensely useful to me and it serves as an important documentation of the entire flow of the project. I don't think people should look at session histories at all unless they need to.
but that takes more tokens and time. if you just save the raw log, you can always do that later if you want to consume it. plus, having the full log allows asking many different questions later.
How’s it any different than a diff log?
Better question: how is it in any way similar?
If you read the history of both and assuming that there’s good comments and documentation, it shows you the reasoning that went into the decision-making