| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by matheusmoreira 1410 days ago

This was my approach too and it's been working great. Nowadays data isn't rendered directly into HTML anymore, it gets downloaded from some JSON API endpoint. So I use network monitoring tools to see where it's coming from and then inferface with the endpoint directly. I essentially wrote custom clients for someone else's site. One of my scrapers is actually just curl piped into jq. Sometimes they change the API and I have to adapt but that's fine.

> I understand companies can put roadblocks to hinder this

Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

1 comments

shubhamjain 1410 days ago

> Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

Cloudflare Bot Protection[1] is a popular one. The website is guarded by a layer of code that needs to be executed before continuing. Normal browsers will follow through. It can be hard to bypass.

[1]: https://www.cloudflare.com/pg-lp/bot-mitigation-fight-mode/

link

jamescampbell 1409 days ago

I have a codebase that defeats cloudflare protection. Felt like I had keys to kingdom.

link

GekkePrutser 1410 days ago

So that would break text browsers too, right? :(

And users with JS disabled for privacy reasons.

link