Hacker News new | ask | show | jobs
by extremeMath 2115 days ago
I failed a webscraping project due to strong anti bot detection.

They checked for bots through useragent/screen size, maybe mouse movements, trends in searches (same area code), etc... (Can they really detect me through my internet connection headers, despite proxies?)

It was impossible for me to scrape, they won.

2 comments

same here.

there are 2 approaches they use that make developing bots very difficult.

1. they detect device input. if there is no mouse movement, while the website is being loaded, they will consider it's a bot.

2. they detect the order of page visiting. A human visitor will not enumerate all paths, instead, they follow certain patterns. This is detectable with their machine learning model.

I really don't have a solution for #2

I think the solution is "hybrid" scraping with a human driving the clicks and the scraper passively collecting the data.

If you record, you can probably teach AI to emulate.

I love this. I might try it. It doesn't scale, but that's okay for my project.
Which website are you talking about here? Even the hardest to scrape sites can be scraped with proxies and headless browsers.
Theoretically.

Think about websites that have every reason to stop you from scraping.

It's not reasonable to disect their huge obfuscated js code. So headless doesn't really work.