Hacker News new | ask | show | jobs
by LinuxBender 1007 days ago
Would an easier way to scrape 100s of websites be useful to you?

Not to me but I would be curious if you found a way to mimic a real human browsing a site aside from Chrome Headless. Do your TCP packets and HTTPS requests look indistinguishable from real people? The reason I ask is that its a fun hobby for me to see if I can block a scraper without any proprietary tools.

1 comments

No we haven't. We're building off existing scraping tools (eg. Selenium) and building the reasoning engine that will take actions on the page via these tools

Unfamiliar with blocking mechanisms, could you share some things you would do to block existing selenium scraping jobs?

The answers are different for each scraper operator and I don't have a generalized answer specific to Selenium so it depends on what unique identifiers one can find and where they host their scrapers. Some use Javascript to try to detect it [1] but I just have silly hobby sites these days. I personally like to look at TCP/IP headers and anything else unique the scraper is doing to intercept things sooner. Some proxies are easy to spot by changes to MSS and TTL. Some bots add or do not add some browser headers I would expect to see. Some bot owners don't even change the user-agent but that is trivial to spoof, just most don't bother to spoof. I doubt you would be lazy like that if you are offering a scraping service so I am betting your scraper would be harder to detect and more fun to tinker with.

So I guess to answer your question I would have to see some example packets. You could send some requests to the awful little blog I have in my profile if you were willing to share.

I was tinkering with custom figlet ASCII text at one point which automation can be made to solve, but unless I were hosting a popular site nobody would bother and I could just rotate through half of the figlet fonts, modulate the spacing, direction and letter overlap to make it fun. For now however I try to avoid anything that requires client interaction also fully accepting I am not doing anything as advanced as CDN's like Cloudflare do in their sleep. It's just a hobby for me.

[1] - https://www.zenrows.com/blog/selenium-avoid-bot-detection