Hacker News new | ask | show | jobs
by matheusmoreira 1410 days ago
This was my approach too and it's been working great. Nowadays data isn't rendered directly into HTML anymore, it gets downloaded from some JSON API endpoint. So I use network monitoring tools to see where it's coming from and then inferface with the endpoint directly. I essentially wrote custom clients for someone else's site. One of my scrapers is actually just curl piped into jq. Sometimes they change the API and I have to adapt but that's fine.

> I understand companies can put roadblocks to hinder this

Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

1 comments

> Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

Cloudflare Bot Protection[1] is a popular one. The website is guarded by a layer of code that needs to be executed before continuing. Normal browsers will follow through. It can be hard to bypass.

[1]: https://www.cloudflare.com/pg-lp/bot-mitigation-fight-mode/

I have a codebase that defeats cloudflare protection. Felt like I had keys to kingdom.
So that would break text browsers too, right? :(

And users with JS disabled for privacy reasons.