| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by klodolph 268 days ago
	I don’t understand why the agents requesting HTML can’t extract text from HTML themselves. You don’t have to feed the entire HTML document to your LLM. If that’s wasteful, why not have a little bit of glue that does some conversion?

2 comments

simonw 268 days ago

Converting HTML into Markdown isn't particularly hard. Two methods I use:

1. The Jina reader API - https://jina.ai/reader/ - add r.jina.ai to any URL to run it through their hosted conversion proxy, eg https://r.jina.ai/www.skeptrune.com/posts/use-the-accept-hea...

2. Applying Readability.js and Turndown via Playwright. Here's a shell script that does that using my https://shot-scraper.datasette.io tool: https://gist.github.com/simonw/82e9c5da3f288a8cf83fb53b39bb4...

link

skeptrune 268 days ago

I learned that the golang CLI[1] is the best through my work simplifying Firecrawl[2]. However, in this case I used one available through npmjs such that it would work with `npx` for the CF worker builds.

[1]: https://github.com/JohannesKaufmann/html-to-markdown

[2]: https://github.com/devflowinc/firecrawl-simple

link

osener 268 days ago

A lightweight alternative to Playwright, which starts a browser instance, is using an HTML parser and DOM implementation like linkedom.

This is much cheaper to run on a server. For example: https://github.com/ozanmakes/scrapedown

link

skeptrune 268 days ago

It's always better for the agent to have fewer tools and this approach means you get to avoid adding a "convert HTML to markdown" one which improves efficiency.

Also, I doubt most large-scale scrapers are running in agent loops with tool calls, so this is probably necessary for those at a minimum.

link

klodolph 268 days ago

This does not make any sense to me. Can you elaborate on this?

It seems “obvious” to me that if you have a tool which can request a web page, you can make it so that this tool extracts the main content from the page’s HTML. Maybe there is something I’m missing here that makes this more difficult for LLMs, because before we had LLMs, this was considered an easy problem. It is surprising to me that the addition of LLMs has made this previously easy, efficient solution somehow unviable or inefficient.

I think we should also assume here that the web site is designed to be scraped this way—if you don’t, then “Accept: text/markdown” won’t work.

link

hahnbee 268 days ago

If you have a website and you're optimizing it for GEO, you can't assume that the agents are going to have the glue. So as the person maintaining the website you implement as much of the glue as possible.

link

klodolph 267 days ago

That sounds completely backwards. It seems, again, obvious to me that it would be easier to add HTML->markdown converters to agents, given that there are orders of magnitude more websites out there compared to agent.

If your agent sucks so bad that it isn’t capable of consuming HTML without tokenizing the whole damn thing, wouldn’t you just use an agent that isn’t such a mess?

This whole thing kinda sounds crazy inefficient to me.

link

xg15 268 days ago

I don't think it's about including this as a tool, just as general preprocessing before the agent even gets the text.

link

skeptrune 268 days ago

Well that's what I implemented. There are markdown docs for every HTML file and the proxy decides to serve either markdown or HTML based on the Accept header.

link

xg15 268 days ago

I think GP meant on the client, i.e. agent side. As in, you could deploy this kind of proxy in a forward/non-reverse way inside the agent system, so the LLM always gets markdown, regardless of what the site supports.

There is no real reason to pass HTML with tags and all to the LLM - you can just strip the tags beforehand.

link