| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by skeptrune 268 days ago
	It's always better for the agent to have fewer tools and this approach means you get to avoid adding a "convert HTML to markdown" one which improves efficiency. Also, I doubt most large-scale scrapers are running in agent loops with tool calls, so this is probably necessary for those at a minimum.

2 comments

klodolph 268 days ago

This does not make any sense to me. Can you elaborate on this?

It seems “obvious” to me that if you have a tool which can request a web page, you can make it so that this tool extracts the main content from the page’s HTML. Maybe there is something I’m missing here that makes this more difficult for LLMs, because before we had LLMs, this was considered an easy problem. It is surprising to me that the addition of LLMs has made this previously easy, efficient solution somehow unviable or inefficient.

I think we should also assume here that the web site is designed to be scraped this way—if you don’t, then “Accept: text/markdown” won’t work.

link

hahnbee 267 days ago

If you have a website and you're optimizing it for GEO, you can't assume that the agents are going to have the glue. So as the person maintaining the website you implement as much of the glue as possible.

link

klodolph 267 days ago

That sounds completely backwards. It seems, again, obvious to me that it would be easier to add HTML->markdown converters to agents, given that there are orders of magnitude more websites out there compared to agent.

If your agent sucks so bad that it isn’t capable of consuming HTML without tokenizing the whole damn thing, wouldn’t you just use an agent that isn’t such a mess?

This whole thing kinda sounds crazy inefficient to me.

link

xg15 267 days ago

I don't think it's about including this as a tool, just as general preprocessing before the agent even gets the text.

link

skeptrune 267 days ago

Well that's what I implemented. There are markdown docs for every HTML file and the proxy decides to serve either markdown or HTML based on the Accept header.

link

xg15 267 days ago

I think GP meant on the client, i.e. agent side. As in, you could deploy this kind of proxy in a forward/non-reverse way inside the agent system, so the LLM always gets markdown, regardless of what the site supports.

There is no real reason to pass HTML with tags and all to the LLM - you can just strip the tags beforehand.

link