|
|
|
|
|
by skeptrune
268 days ago
|
|
It's always better for the agent to have fewer tools and this approach means you get to avoid adding a "convert HTML to markdown" one which improves efficiency. Also, I doubt most large-scale scrapers are running in agent loops with tool calls, so this is probably necessary for those at a minimum. |
|
It seems “obvious” to me that if you have a tool which can request a web page, you can make it so that this tool extracts the main content from the page’s HTML. Maybe there is something I’m missing here that makes this more difficult for LLMs, because before we had LLMs, this was considered an easy problem. It is surprising to me that the addition of LLMs has made this previously easy, efficient solution somehow unviable or inefficient.
I think we should also assume here that the web site is designed to be scraped this way—if you don’t, then “Accept: text/markdown” won’t work.