Hacker News new | ask | show | jobs
by electroly 394 days ago
I'm not convinced--codebase indexing is still a killer feature in Cursor. I have tens of thousands of reference files stashed in the project directory to be indexed so that any time the model reaches out to the codebase search tool with a question, it finds a file with the answer. Lots of it is not code and has no AST representation; it's documentation. Without codebase indexing, it may entirely miss the context.

1. This argument seems flawed. Codebase search gives it a "foot in the door"; from that point it can read the rest of the file to get the remaining context. This is what Cursor does. It's the benefit of the agentic loop; no single tool call needs to provide the whole picture.

2. This argument is "because it's hard we shouldn't do it". Cursor does it. Just update the index when the code changes. Come on.

3. This argument is also "because it's hard we shouldn't do it". Cursor does it. The embeddings go in the cloud and the code is local. Enforced Privacy Mode exists. You can just actually implement these features rather than throwing your hands up and saying it's too hard.

This honestly makes me think less of Cline. They're wrong about this and it seems like they're trying to do damage control because they're missing a major feature.

3 comments

Sounds like you're the exception rather than the rule. I've never seen a project with tens of thousands of files worth of accurate documentation. Any project that has that much documentation the majority of it is outdated and/or wrong. Some of it is still useful, but only in a historical context.

The code is the authoritative reference.

Sure, I'm a power user with a monster codebase, never said that I wasn't. RAG is a power user feature.

The files are generated from external sources, pulling together as much information as I could collect. It's a script so I can keep it up to date. I think there is roughly no programmer out there who needs to be told that documentation needs to be up-to-date; this is obvious enough that I'm trying not to be offended by your strawman. You could have politely assumed, since I said I have it working, that it does actually work. I am doing productive work with this; it's not theoretical.

By documentation I assumed you meant internal documentation, like on a company Wiki.

External documentation is presumably already in the LLM's training data, so it should be extraneous to pull it into context. Obviously there's a huge difference between "should be" and "is" otherwise you wouldn't be putting in the work to pull it into context.

I'd guess the breakdown is about:

- 80%: Information about databases. Schemas, sample rows, sample SQL usages (including buried inside string literals and obscured by ORMs), comments, hand-written docs. I collect everything I can find about each table/view/procedure and stick it in a file named after it.

- 10%: Swagger JSONs for internal APIs I have access to, plus sample responses.

- 10%: Public API documentation that it should know but doesn't.

The last 10% isn't nothing; I shouldn't have to do that and it's as you say. I've particularly had problems with Apple's documentation; higher than expected hallucunation rate in Swift when I don't provide the docs explicitly. Their docs require JavaScript (and don't work with Cursor's documentation indexing feature) which gives me a hunch about what might have happened. It was a pain in the neck for me to scrape it. I expect this part to go away as tooling gets better.

The first 90% I expect to be replaced by better MCP tools over time, which integrate vector indexing along with traditional indexing/exploration techniques. I've got one written to allow AI to interactively poke around the database, but I've found it's not as effective as the vector index.

A better argument against vector embeddings for AI code agents is that model performance degrades with number of tokens used (even when well below the context window limit), and vector chunks are more prone to bloating the context window with unhelpful noise than more targeted search techniques.

Claude Code doesn't do vector indexing, and neither does Zed. There aren't any rigorous studies comparing these tools, but you can find plenty of anecdotes of people preferring the output of Claude Code and/or Zed to Cursor's, and search technique is certainly a factor there!

Same with Augment. Indexing makes a huge difference in a large monorepo and I can't imagine working in an editor that doesn't support an LLM with full indexing.