| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by throwup238 450 days ago

An important Cursor feature that no one else seems to have implemented yet is documentation indexing. You give it a base URL and it crawls and generates embeddings for API documentation, guides, tutorials, specifications, RFCs, etc in a very language agnostic way. That plus an agent tool to do fuzzy or full text search on those same docs would also be nice. Referring to those @docs in the context works really well to ground the LLMs and eliminate API hallucinations

Back in 2023 one of the cursor devs mentioned [1] that they first convert the HTML to markdown then do n-gram deduplication to remove nav, headers, and footers. The state of the art for chunking has probably gotten a lot better though.

[1] https://forum.cursor.com/t/how-does-docs-crawling-work/264/3

5 comments

mapmap 449 days ago

The continue.dev plugin for Visual Studio Code provides documentation indexing. You provide a base URL and a tag. The plugin then scrapes the documentation and builds a RAG index. This allows you to use the documentation as context within chat. For example, you could ask @godotengine what is a sprite?

link

conartist6 449 days ago

So this is why everything is going behind Anubis then?

link

GreenWatermelon 449 days ago

Nah, Anubis combats systematic Scraping of the web by data scrapers, not actual user agents.

link

conartist6 449 days ago

A scraper in this case is the agent of the user. Doesn't make it not a scraper that can and will get trapped.

link

lgiordano_notte 449 days ago

Cursor’s doc indexing is acc one of the few AI coding features that feels like it saves time. Embedding full doc sites, deduping nav/header junk, then letting me reference @docs inline actually improves context grounding instead of guessing APIs.

link

steveharman 450 days ago

Just use the Context7 MCP ? Actually I'm assuming Void supports MCP.

link

gesman 449 days ago

Context7 is missing lots of info pieces from the repos it indexing and getting overbloated with similar sounding repos, which is becoming confusing for LLM's.

link

Aeroi 450 days ago

can you elaborate on how context7 handles document indexing or web crawling. If i connect to the mcp server, will it be able to crawl websites fed to it?

link

andrewpareles 450 days ago

Agreed - this is one of the better solutions today.

link

andrewpareles 450 days ago

This is a good point.We've stayed away from documentation assuming that it's more of a browser agent task, and I agree with other commenters that this would make a good MCP integration.

I wonder if the next round of models trained on tool-use will be good at looking at documentation. That might solve the problem completely, although OSS and offline models will need another solution. We're definitely open to trying things out here, and will likely add a browser-using docs scraper before exiting Beta.

link

RobinL 450 days ago

I agree that on the face of it this is extremely useful. I tried using it for multiple libraries and it was a complete failure though, it failed to crawl fairly standard mkdocs and sphynx sites. I guess it's better for the 'built in' ones that they've pre-indexed

link

throwup238 450 days ago

I use it mostly to index stuff like Rust docs on docs.rs and rendered mdbooks. The RAG is hit or miss but I haven’t had trouble getting things indexed.

link