Hacker News new | ask | show | jobs
Ask HN: Would an easier way to scrape 100s of websites be useful to you?
7 points by asim-shrestha 1006 days ago
In the process of building AI agents, we've found that what we built could eventually be good at dynamically scraping data across a variety of websites (10s to 100s of different sites at a time)

Our understanding is that existing web scraping tools are bad at this because you need to write custom scraping configurations per site. Not only that, but when a site changes styling, it might completely break your automation. With agents however, you can provide a high level natural language overview of the data you'd like from a website or class of websites, and the agent system will deal with the details of traversing a page and fetching data automatically.

We’re curious how useful this might be for people. If you’ve experienced issues that this might solve or have already explored the space, I'd love to hear from you!

4 comments

Scraping hundreds of site seems mostly unethical.

My sites are periodically on the wrong end of scrapers, greedy by design or in error, occasionally needing to be manually blocked or even legals threatened.

Just because something can be done, done't mean it should. It also doesn't mean that you should make it easier. Any more than offering 'better' SPAM engines...

Appreciate the input Damon, ethical concerns are definitely a consideration and we'll want to be respectful of mechanisms like robots.txt
Yawn, ignore that input. Scrape away!
What kind of websites? You mean like social media sites that are obfuscated to prevent scraping? I suppose, it would have to be quite reliable.

I don't know how relevant this is, but I was thinking that you could probably use some sort of AI to enhance OCR and convert written documents into some sort of semantic form like HTML or Latex. That would allow you to use books to scrape information, and written books still have a lot of untapped knowledge.

It seems like the demand for web scraping and such is to create datasets for ML training. And now you are using AI for scraping. So it is sort of a self-improving cycle

Not specifically social media sites, getting through prevention would be difficult and there are already a lot of existing companies working on scraping popular social media sites.

Interesting idea, we're definitely looking into coupling OCR and LLMs today but not for that particular case. I think raw language models with a good workflow are typically good enough to extract structured data from things like books

ML training is definitely one area we can see this being useful. General data aggregation across a large industry (clothing, retail, etc) is something we want to look into. Also RPA style workflows involving multi-click actions across a variety of sites

Would an easier way to scrape 100s of websites be useful to you?

Not to me but I would be curious if you found a way to mimic a real human browsing a site aside from Chrome Headless. Do your TCP packets and HTTPS requests look indistinguishable from real people? The reason I ask is that its a fun hobby for me to see if I can block a scraper without any proprietary tools.

No we haven't. We're building off existing scraping tools (eg. Selenium) and building the reasoning engine that will take actions on the page via these tools

Unfamiliar with blocking mechanisms, could you share some things you would do to block existing selenium scraping jobs?

The answers are different for each scraper operator and I don't have a generalized answer specific to Selenium so it depends on what unique identifiers one can find and where they host their scrapers. Some use Javascript to try to detect it [1] but I just have silly hobby sites these days. I personally like to look at TCP/IP headers and anything else unique the scraper is doing to intercept things sooner. Some proxies are easy to spot by changes to MSS and TTL. Some bots add or do not add some browser headers I would expect to see. Some bot owners don't even change the user-agent but that is trivial to spoof, just most don't bother to spoof. I doubt you would be lazy like that if you are offering a scraping service so I am betting your scraper would be harder to detect and more fun to tinker with.

So I guess to answer your question I would have to see some example packets. You could send some requests to the awful little blog I have in my profile if you were willing to share.

I was tinkering with custom figlet ASCII text at one point which automation can be made to solve, but unless I were hosting a popular site nobody would bother and I could just rotate through half of the figlet fonts, modulate the spacing, direction and letter overlap to make it fun. For now however I try to avoid anything that requires client interaction also fully accepting I am not doing anything as advanced as CDN's like Cloudflare do in their sleep. It's just a hobby for me.

[1] - https://www.zenrows.com/blog/selenium-avoid-bot-detection

My take is that the "idea people" usually underestimate how easy it is make user interfaces and overestimate the difficulty of scraping, API "integrations", etc. (It's what makes Zapier such a successful racket)
Our take is scraping a website in isolation is typically quite easy for a somewhat technical person.

Scraping at larger scale is where it becomes challenging, the problems we want to tackle are: 1. You need at least a bit of technical expertise to do things like configure selectors properly 2. Websites typically have moderation in place to block scrapers 3. Scrapers are prone to changes in the site layout 4. Creating on the order of 100s of scrapers is difficult and time consuming. Creating this many will amplify the previous issue