| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lmeyerov 166 days ago
	Something I would add is planning. A big "aha" for effective use of these tools is realizing they run on dynamic TODO lists. Ex: Plan mode is basically bootstrapping how that TODO list gets seeded and how todos ground themselves when they get reached, and user interactions are how you realign the todo lists. The todolist is subtle but was a big shift in coding tools, and many seem to be surprised when we discuss it -- most seem to focus on whether to use plan mode or not, but todo lists will still be active. I ran a fun experiment last month on how well claude code solves CTFs, and disabling the TodoList tool and planning is 1-2 grade jumps: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... . Fwiw, I found it funny how the article stuffs "smarter context management" into a breeze-y TODO bullet point at the end for going production-grade. I've been noticing a lot of NIH/DIY types believing they can do a good job of this and then, when forced to have results/evals that don't suck in production, losing the rest of the year on that step. (And even worse when they decide to fine-tune too.)

11 comments

btown 166 days ago

I'm unsure of its accuracy/provenance/outdatedness, but this purportedly extracted system prompt for Claude Code provides a lot more detail about TODO iteration and how powerful it can be:

https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...

https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...

I find it fascinating that while in theory one could just append these as reasoning tokens to the context, and trust the attention algorithm to find the most recent TODO list and attend actively to it... in practice, creating explicit tools that essentially do a single-key storage are far more effective and predictable. It makes me wonder how much other low-hanging fruit there is with tool creation for storing language that requires emphasis and structure.

lmeyerov 166 days ago

I find in coding + investigating there's a lot of mileage to being fancier on the todo list. Eg, we make sure timestamps, branches, outcomes, etc are represented. It's impressive how far they get with so little!

For coding, I actually fully take over the todo list in codex + claude: https://github.com/graphistry/pygraphistry/blob/master/ai/pr...

In Louie.ai, for investigations, we're experimenting with enabling more control of it, so you can go with the grain, vs that kind of wholecloth replacement

btown 166 days ago

Ooh, am I reading correctly that you're using the filesystem as the storage for a "living system prompt" that also includes a living TODO list? That's pretty cool!

And on a separate note - it looks like you're making a system for dealing with graph data at scale? Are you using LLMs primarily to generate code for new visualizations, or also to reason directly about each graph in question? To tie it all together, I've long been curious whether tools can adequately translate things from "graph space" to "language space" in the context of agentic loops. There seems to be tremendous opportunity in representing e.g. physical spaces as graphs, and if LLMs can "imagine" what would happen if they interacted with them in structured ways, that might go a long way towards autonomous systems that can handle truly novel environments.

lmeyerov 166 days ago

yep! So all repos get a (.gitignore'd) folder of `plans/<task>/plan.md` work histories . That ends up being quite helpful in practice: calculating billable hours of work, forking/auditing/retrying, easier replanning, etc. At the same time, I rather be with-the-grain of the agentic coder's native systems for plans + todos, eg, alignment with the models & prompts. We've been doing this way b/c we find the native to be weaker than what these achieve, and to hard to add these kind of things to them.

RE:Other note, yes, we have 2 basic goals:

1. Louie to make graphs / graphistry easier. Especially when connected to operational databases (splunk, kusto, elastic, big query, ...). V1 was generating graphistry viz & GFQL queries. We're now working on louie inside of graphistry, for more dynamic control of the visual analysis environment ("filter to X and color Y as Z"), and as you say, to go straight to the answer too ("what's going on with account/topic X"). We spent years trying to bring jupyter notebooks etc to operational teams as a way to get graph insights to their various data, and while good for a few "data 1%'ers", too hard for most, and Louie has been a chance to rethink that.

2. Louie has been seeing wider market interest beyond graph, basically "AI that investigates" across those operational DBs (& live systems). You can think of it as vibe coding is code-oriented, while louie is vibe investigating that is more data-oriented. Ex: Native plans don't think in unit tests but cross-validation, and instead of grepping 1,000 files, we get back a dataframe of 1M query results and pass that between the agents for localized agentic retrieval on that vs rehammering db. The CCC talk gives a feel for this in the interactive setting.

jodleif 166 days ago

For humans org-mode is good at this

homarp 166 days ago

aren't the system prompt of Claude public in the doc at https://platform.claude.com/docs/en/release-notes/system-pro... ?

Stagnant 166 days ago

The system prompt of claude code changes constantly. I use this site to see what has changed between versions: https://cchistory.mariozechner.at/

It is a bit weird why anthropic doesn't make that available more openly. Depending on your preferences there is stuff in the default system prompt that you may want to change.

I personally have a list of phrases that I patch out from the system prompt after each update by running sed on cc's main.js

handfuloflight 166 days ago

What are those phrases? Why do you exclude them?

btown 166 days ago

This is for Claude Code, not just Claude.

what 166 days ago

From elsewhere in that prompt:

> Only use emojis if the user explicitly requests it. Avoid adding emojis to files unless asked.

When did they add this? Real shame because the abundance of emojis in a readme was a clear signal of slop.

rrvsh 166 days ago

I've had a LOT of success keeping a "working memory" file for CLI agents. Currently testing out Codex now, and what I'll do is spend ~10mins hashing out the spec and splitting it into a list of changes, then telling the agent to save those changes to a file and keep that file updated as it works through them. The crucial part here is to tell it to review the plan and modify it if needed after every change. This keeps the LLM doing what it does best (short term goals with limited context) while removing the need to constantly prompt it. Essentially I feel like it's an alternative to having subagents for the same or a similar result

tacone 166 days ago

I use a folder for each feature I add. The LLM is only allowed to output markdown file in the output subfolder (of course it doesn't always obey, but it still limits pollution in the main folder)

The folder will contain a plan file and a changelog. The LLM is asked to continously update the changelog.

When I open a new chat, I attach the folder and say: onboard yourself on this feature then get back to me.

This way, it has context on what has been done, the attempts it did (and perhaps failed), the current status and the chronological order of the changes (with the recent ones being usually considered more authoritative)

fastball 166 days ago

Planning mode actually creates whole markdown files, then wipes the context that was required to create that plan before starting work. Then it holds the plan at the system prompt level to ensure it remains top of mind (and survives unaltered during context compaction).

ramoz 166 days ago

I don’t think it wipes the context window.

matchagaucho 166 days ago

The TODO lists are also frequently re-inserted into the context HEAD to keep the LLM aware of past and next steps.

And in the event of context compression, the TODO serves as a compact representation of the session.

dboon 165 days ago

I’m a DIY (or, less generously and not altogether inaccurately, NIH) type who thinks he could do a good job of smarter context management. But, I have no particular reason to know better than anyone else. Tell me more. What have you seen? What kinds of approaches? Who’s working on it?

lmeyerov 165 days ago

I'm optimistic most people can, given the time and resources

In the CCC video, you may enjoy the section on how we are moving to eval-driven AI coding for how we more methodically improve agents. Even more so, the slides before on motivating why it gets harder to improve quality as you go on.

One big rub is it's one of those areas where people grossly misunderestimate what is needed for the quality goals they're likely targeting, and if a long-living artifact to be maintained, the on-going costs. It's similar to junior engineers or short-term contractors who never had to build production-grade software before and haven't had to live with their decisions: These are quite learnable engineering skills, and I've found it useful to burn your fingers before having confidence in the surprising weight of cost/benefit decisions. The more autonomy and expectations you are targeting for the agent, the more so.

shnpln 166 days ago

Oh yes, I commonly add something like "Use a very granular todo list for this task" at the end of my prompts. And sometimes I will say something like "as your last todo, go over everything you just did again and use a linter or other tools to verify your work is high quality"

hadlock 166 days ago

Right now I start chatting with a separate LLM about the issue, the best structure for maintainability and then best libraries for the job, edge and corner cases, how to handle those, and then have it spit out a prompt and a checklist. If it has a UI I'll draw something in paint and refine it before having the LLM describe it in detail and primary workflow etc. and tell it to format it for an agent to use. That will usually get me a functional system on the first try which can then be iterated on.

That's for complicated stuff. For throw-away stuff I don't need to maintain past 30 days like a script I'll just roll the dice and let it rip.

shnpln 166 days ago

Yeah, this is a good idea. I will have a Claude chat session and a Claude Code session open side by side too.

Like a manual sub agents approach. I try not to pollute the Claude code session context with meanderings to much. Do that in the chat and bring the condensed ideas over.

renjimen 166 days ago

If you have pre-commit hooks it should do this last bit automatically, and use your project settings

shnpln 166 days ago

Yes, I do. But it does not always use them when I change contexts. I just get in the habit of saying it. Belt and suspenders approach.

sathish316 165 days ago

It’s surprising how simple TodoWrite and TodoRead tools are in planning and making sure an Agent follows the plan.

This is supposed to be an emulator of Claude’s own TodoWrite and TodoRead, which does a full update of a todo.json for every task update. A nice use of composition of edit tool - https://github.com/joehaddad2000/claude-todo-emulator

sathish316 165 days ago

Complex planning and orchestration for Multi-step usecases or persistent Todo lists is achievable by spinning up your own tools that does something similar to this.

By extending Claude Todo emulator, It was possible to make the agent come up with Multi-step Hierarchical plans and follow it and track updates on it for usecases like Oncall Troubleshooting Runbooks.

PS: the above open source repo does not provide single task update as a tool, which is not hard to implement on your own

veselin 166 days ago

I run evals and the Todo tool doesn't help most of the time. Usually models on high thinking would maintain Todo/state in their thinking tokens. What Todo helps is for cases like Anthropic models to run more parallel tool calls. If there is a Todo list call, then some of the actions after are more efficient.

What you need to do is to match the distribution of how the models were RL-ed. So you are right to say that "do X in 200 lines" is a very small part of the job to be done.

lmeyerov 166 days ago

Curious what kinds of evals you focus on?

We're finding investigating to be same-but-different to coding. Probably the most close to ours that has a bigger evals community is AI SRE tasks.

Agreed wrt all these things being contextual. The LLM needs to decide whether to trigger tools like self-planning and todo lists, and as the talk gives examples of, which kind of strategies to use with them.

veselin 165 days ago

I am taking for SWE bench style problems where Todo doesn't help, except for more parallelism.

lmeyerov 164 days ago

Was guessing that, coding tasks are a valuable but myopic lense :)

I'm guessing a self-updating plan there is sufficient. I'm not actually convinced today's current plan <> todolist flow makes sense - in the linked PLAN.md, it gets unified, and that's how we do ai coding. I don't have evals on this, but from a year of vibes coding/engineering, that's what we experientially reached across frontier coding models & tools. Nowadays we're mixing in evals too, but that's a more complicated story.

jcims 166 days ago

Mind if I ask what models you’re using for CTF? I got out of the game about ten years ago and have been recently thinking about doing my toes back in.

lmeyerov 166 days ago

Yep -- one fun experiment early in the video is showing sonnet 4.5 -> opus 4.5 gave a 20% lift

We do a bit of model-per-task, like most calls are sending targeted & limited context fetches into faster higher-tier models (frontier but no heavy reasoning tokens), and occasional larger data dumps (logs/dataframes) sent into faster-and-cheaper models. Commercially, we're steering folks right now more to openai / azure openai models, but that's not at all inherent. OpenAI, Claude, and Gemini can all be made to perform well here using what the talk goes over.

Some of the discussion earlyish in the talk and Q&A after is on making OSS models production-grade for these kinds of investigation tasks. I find them fun to learn on and encourage homelab experiments, and for copilots, you can get mileage. For more heavy production efforts, I typically do not recommend them for most teams at this time for quality, speed, practicality, and budget reasons if they have the option to go with frontier models. However, some bigger shops are doing it, and I'd be happy to chat how we're approaching quality/speed/cost there (and we're looking for partners on making this easier for everyone!)

jcims 166 days ago

Nice! Thank you!

I just did an experiment yesterday with Opus 4.5 just operating in agent mode in vscode copilot. Handed it a live STS session for AWS to see if it could help us troubleshoot an issue. It was pretty remarkable seeing it chop down the problem space and arrive at an accurate answer in just a few mins.

I'll definitely check out the video later. Thanks!

bdangubic 166 days ago

at the end of the year than you get “How to Code Claude Code in 200 Million Lines of Code” :)

redanddead 163 days ago

ironic how the peak of code is a TODO list app