Hacker News new | ask | show | jobs
by luigi23 651 days ago
Why are scrapers so popular nowadays?
5 comments

There’s a lot of data that we should have programmatic access to that we don’t.

The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.

Any website that has my data and doesn’t give me access to it is a great target for scraping.

I'd say scrapers have always been popular, but I imagine they're even more popular nowadays with all the tools (AI but also non-AI) readily available to do cool stuff on a lot of data.
Bingo. During the pandemic, I started a project to keep myself busy by trying to scrape stock market ticker data and then do some analysis and make some pretty graphs out of it. I know there are paid services for this, but I wanted to pull it from various websites for free. It took me a couple months to get it right. There are so many corner cases to deal with if the pages aren't exactly the same each time you load them. Now with the help of AI, you can slap together a scraping program in a couple of hours.
Was it profitable?
I'm sure it was profitable in keeping him busy during the pandemic. Not everything has to derive monetary value, you can do something for experience, fun, kick the tyres, open-source and/or philanthropic avenues.

Besides it's a low margin, heavily capitalized and heavily crowded market you'd be entering and not worth the negative-monetary investment in the short and medium term (unless you wrote AI in the title and we're going to the mooooooon babyh)

It was in the sense that I learned that trying to beat the market is fundamentally impossible/stupid, so just invest in index funds.
Because publishers don’t push structured data or APIs enough to satisfy demand for the data.
Got it, but why is it booming now and often it’s a showcase of llm model? Is there some secret market/ usecase for it?
Building scrapers sucks.

It's generally not hard because it's conceptually very difficult, or that it requires extremely high level reasoning.

It sucks because when someone changes "<section class='bio'>" to "<div class='section bio'>" your scraper breaks. I just want the bio and it's obvious what to grab, but machines have no nuance.

LLMs have enough common sense to be able to deal with these things and they take almost no time to work with. I can throw html at something, with a vague description and pull out structured data with no engineer required, and it'll probably work when the page changes.

There's a huge number of one-off jobs people will do where perfect isn't the goal, and a fast solution + a bit of cleanup is hugely beneficial.

Another approach is to use a regexp scraper. These are very "loose" and tolerant of changes. For example, RNSAFFN.com uses regular expressions to scrape the Commitments of Traders report from the Commodity Futures Trading Commission every week.
My experience has been the opposite: regex scrapers are usually incredibly brittle, and also harder to debug when something DOES change.

My preferred approach for scraping these days is Playwright Python and CSS selectors to select things from the DOM. Still prone to breakage, but reasonably pleasant to debug using browser DevTools.

I don't know if many has the same use case but... I'm heavily relying on this right now because my daughter started school. The school board, the school, and the teacher each use a different app to communicate important information to parents. I'm just trying to make one feed with all of them. Before AI it would have been hell to scrape, because you can imagine those apps are terrible.

Fun aside: The worst one of them is a public Facebook page. The school board is making it their official communication channel, which I find horrible. Facebook is making it so hard to scrape. And if you don't know, you can't even use Facebook's API for this anymore, unless you have a business verified account and go through a review just for this permission.

Scrapers have always been notoriously brittle and prone to breaking completely when pages make even the smallest of structural changes.

Scraping with LLMs bypasses that pitfall because it's more of a summarization task on the whole document, rather than working specifically on a hard-coded document structure to extract specific data.

Personally I find it's better for archiving as most sites that don't provide a convenient way to save their content directly. Occasionally, I do it just to make a better interface over the data.
There's been a large push to do server-side rendering for web pages which means that companies no longer have a publicly facing API to fetch the data they display on their websites.

Parsing the rendered HTML is the only way to extract the data you need.

I've had good success running Playwright screenshots through EasyOCR, so parsing the DOM isn't the only way to do it. Granted, tables end up pretty messy...
We've been doing something simliar for VLM Run [1]. A lot of websites that have obfuscated HTML / JS or rendered charts / tables tend to be hard to parse with the DOM. Taking screenshots are definitely more reliable and future-proof as these webpages are built for humans to interact with.

That said, the costs can be high as the OP says, but we're building cheaper and more specialized models for web screenshot -> JSON parsing.

Also, it turns out you can do a lot more than just web-scraping [2].

[1] https://vlm.run

[2] https://docs.vlm.run/introduction

What do you think all these LLM stuff will evolve into? Of course it's moving on from chitchat on stale information and now onto "automate the web" kinda phase, like it or not.