| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by extremeMath 2115 days ago

I failed a webscraping project due to strong anti bot detection.

They checked for bots through useragent/screen size, maybe mouse movements, trends in searches (same area code), etc... (Can they really detect me through my internet connection headers, despite proxies?)

It was impossible for me to scrape, they won.

2 comments

billconan 2115 days ago

same here.

there are 2 approaches they use that make developing bots very difficult.

1. they detect device input. if there is no mouse movement, while the website is being loaded, they will consider it's a bot.

2. they detect the order of page visiting. A human visitor will not enumerate all paths, instead, they follow certain patterns. This is detectable with their machine learning model.

I really don't have a solution for #2

link

forgotmypw17 2115 days ago

I think the solution is "hybrid" scraping with a human driving the clicks and the scraper passively collecting the data.

If you record, you can probably teach AI to emulate.

link

extremeMath 2115 days ago

I love this. I might try it. It doesn't scale, but that's okay for my project.

link

r_singh 2115 days ago

Which website are you talking about here? Even the hardest to scrape sites can be scraped with proxies and headless browsers.

link

extremeMath 2115 days ago

Theoretically.

Think about websites that have every reason to stop you from scraping.

It's not reasonable to disect their huge obfuscated js code. So headless doesn't really work.

link