Hacker News new | ask | show | jobs
by mnmkng 1395 days ago
Hey! Crawlee uses the libraries from our fingerprint suite internally. https://github.com/apify/fingerprint-suite#performance

It has an A rating in the BotD (fingerprint.js) detection. Now we're working on improving the CreepJS detection. That one is really tough though. Not even sure if anybody would use it in production environments as it must throw a lot of false positives.

It will always be free and maintained, because we're using it internally in all of our projects. We thought about adding a commercial license like Docker. Open source, but paid if you have more than $10mil revenue or more than 250 employees. But in the end we decided that we won't do even that and it's just free and always be free.

1 comments

Hi! Very cool project. Just out of curiosity, what trips up Crawlee on CreepJS? I haven't heard of anyone actually using it in production (actually don't think it's meant for production use). It's certainly overzealous in its aggregate "trust score", but (a) it seems like a good benchmark to aim for; (b) some of its sub-scores, like "stealth" and "like headless", might be helpful for Crawlee to evaluate, given the signals included in those analyses are fairly simple for people to throw together in their own custom (production) bot detection scripts and are somewhat ubiquitous.
With fingerprints it's a tradeoff between having enough of them for large scale scraping and staying consistent with your environment. E.g. you can get exponentially more combinations if you also use Firefox, Webkit, MacOS and Windows user-agents (and prints) when you're running Chrome on Linux, but you also expose yourself to the better detection algorithms. If you stick to Linux Chrome only prints (which is what you usually run in VMs), you'll be less detectable, but might get rate limited.