| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wraptile 774 days ago

Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.

More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.

LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.

1 comments

_el1s7 770 days ago

Exactly, are you aware of any current efforts of people trying to do that?

link

wraptile 769 days ago

Not anything in open source yet.

link