Hacker News new | ask | show | jobs
by mike_hearn 487 days ago
Aider + AI generated maps and user guides for internal modules has worked well for me. Just today I did my own version of a script that uses Gemini 2 Flash (1M context window) to generate maps of each module in my codebase, i.e. a short one or two sentence description of what's in every file. Aider's repo maps don't work well for me, so I disable them, and I think this will work better.

I also have a scratchpad file that I tell the model it can update to reflect anything new it learns, so that gives it a crude form of memory as it works on the codebase. This does help it use internal utility APIs.

2 comments

LLMs forcing us to improve our documentation habits. Seriously though, many languages allow API doc generation out of comments. Maybe these docs can just be flattened into a file.
Yes sort of. This particular codebase is a mix of Java and Kotlin, and all my internal code is documented with proper Javadocs/KDocs already since years, just for myself and other people I work with. That's partly why Gemini can make such accurate maps.

The problem isn't a lack of docs but rather birds-eye context: even with models that allow huge context windows and are fast, you can drown a model in irrelevant stuff and it's expensive. I'm still with Claude 3.5 for coding and its window is large but not unlimited. You really don't want to add a bunch of source files _and_ the complete API docs for tens of thousands of classes into every prompt, not unless you like waiting whilst money burns and getting problems due to the model getting distracted.

It's also just wasteful, docs contain a lot of redundancy and stuff the model can guess. If you ask models to make notes about only the surprising stuff, it's a form of compression that lets you make smaller prompts.

Aider provides a quick fix because it's easy to control what files are in the context. But to 'level up' I need to let the AI find and add files itself. Aider can do this: it gives the model tools for requesting files to be added to the chat. And in theory, Aider computes a PageRank over symbols and symbolic references to find the most important stuff in the repository and computes a map that's prepended to the prompt so the model knows what to ask for. In practice for reasons I don't understand, Aider's repo maps in this project are full of random useless stuff. Maybe it works better for Python.

Finding the right way to digest codebases is still an open problem. I haven't tried RAG, for instance. If things are well abstracted it in theory shouldn't be needed.

Indexing automatically generated API docs using Cursor seems to work very well. I also index any guides/mdbooks libraries have available, depending on whether I’m trying to implement something new or modifying existing code.
> Aider + AI generated maps and user guides

How do do that? Especially the AI generated map?

I have a custom script. It selects all the source files, strips any license headers and concatenates them like this:

    <source_file name="foo/bar/Baz.java">
    ...
    </source_file>
It then chunks them to fit within model context window limits, sends it to the LLM with a system prompt that asks it to summarize each file in a compact way, and writes the result back out to the tree.

The ugly XML tag is to avoid conflicts. Some other scripts try to make a Markdown document of the tree which is silly because your tree is quite likely to contain Markdown already, and so it's confusing for the model to see ``` that doesn't really terminate the block. Using a marker pattern that's unlikely to occur in your code fixes that.