| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nodoodles 774 days ago
	What I'd love to see is scraper builder that uses LLMs/'magic' to generate optimised scraping rules for any page, ie css selectors and processing rules mapped to output keys. So you can run scraping itself at low cost and high performance..

8 comments

jumploops 774 days ago

Agreed!

Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).

We currently use this at Magic Loops[2] and it works _most_ of the time.

The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).

Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.

[0] https://apify.com/apify/website-content-crawler

[1] https://github.com/extractus/article-extractor

[2] https://magicloops.dev/

[3] https://reworkd.ai/

KhoomeiK 774 days ago

This is essentially what we're building at https://reworkd.ai (YC S23). We had thousands of users try using AgentGPT (our previous product) for scraping and we learned that using LLMs for web data extraction fundamentally does not work unless you generate code.

nodoodles 773 days ago

Awesome to hear! Looking forward to a launch -- the Waitlist form was too long to complete, need to take another LLM to fill that :)

KhoomeiK 773 days ago

1 month away ;)

spxneo 774 days ago

all around automation sucks with LLM thrown on top of it

the statistics are not in its favour

visarga 773 days ago

Code is also hard. You got to generate code that accounts for all possible exceptions or errors. If you want to automate an UI for example, pushing a button can cause all sorts of feedback, errors, consequences that need to be known to write the code.

KhoomeiK 774 days ago

Yep, until you generate code—it's harder from a technical POV but you can get way higher performance & reliability.

longgui0318 774 days ago

Here's a project that describes the use of llm to generate crawling rules and then capture them, but it looks like it's still in the early stages of research.

https://github.com/EZ-hwh/AutoCrawler

nodoodles 773 days ago

Thanks, will look into it, looks promising

nikcub 774 days ago

Most of the top LLM already do this very well. It's because they've been trained on web data, and also because they're being used for precisely this task internally to grab data.

The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.

nodoodles 773 days ago

Agreed there are several complexities but not sure which ‘this’ you mean - specifically updating selectors is one of the areas I had in mind earlier..

selimthegrim 774 days ago

There was one I remember out of UF/FSU called Intoli that seems to have pivoted into consulting.

greggsy 774 days ago

It seems also obvious that one would want to simply drag a box around the content you want, and the tool would just provide some examples to help you refine the rule set.

Ad blockers have had something very close to this for some time, without any sparkly AI buttons.

I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.

uptown 774 days ago

Mozenda does something like that. I haven't used it in many years, so I'm not up to date on what it currently offers.

geuis 774 days ago

That's an interesting take. I've been experimenting with reducing the overall rendered html size to just structure and content and using the LLM to extract content from that. It works quite well. But I think your approach might be more efficient and faster.

nodoodles 773 days ago

One fun mechanism I've been using for reducing html size is diffing (with some leniency) pages from same domain to exclude common parts (ie headers/footers). That preprocessing can be useful for any parsing mechanism..

cpobuda 774 days ago

I have been working on this. Feel free to DM me.

wraptile 774 days ago

Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.

More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.

LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.

_el1s7 770 days ago

Exactly, are you aware of any current efforts of people trying to do that?

wraptile 769 days ago

Not anything in open source yet.