| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jordif 3379 days ago
	Good article! I been doing scraping for the last 10 years and I've seen a lots of differents things to try to avoid us. Also, I'm in the other side protecting websites to ban scrapers, so funny!

1 comments

skinnymuch 3379 days ago

I'm in the same position for the first time (protecting against scraping) and honestly I'm kind of blind right now. Which is weird because of how much scraping I've done (okay not that much). Any tips or tricks or blogs you know of off the top of your head for protecting your site?

link

corford 3379 days ago

Virtually everything can be easily defeated. The only outfit I've consistently seen put up a good fight is Distil. They do it by acting a little like Cloudflare. They put their servers in front of your www facing endpoints and use ML to mine their global client traffic to identify bot signals (aided by some aggressive in-browser javascript fingerprinting).

link

kbenson 3376 days ago

Yeah, Distil is the first outfit I've encountered where they've got the model to make it really hard to reliably bypass. It comes down to "I can spend a significant amount of time trying to bypass this, and I would, but they would likely identify and block me again within a few weeks at most.", and it's not worth it when it's only part of what I need to do to scrap some data, and it's their entire job, and they can afford to hire multiple people.

The economics are in their favor, and I make it a point not to fight economics when I recognize them, it's rarely sustainable.

link

skinnymuch 3370 days ago

Distil is really interesting.

link

skinnymuch 3370 days ago

Interesting thanks.

link

jordif 3378 days ago

After the years, I've arrived at the conclusion that everything can be scrapped. What you have to do is try to put as many walls as you can. But if someone really wants to crawl your site, with the right knowledgement he will able to do it despite of all your walls.

link

skinnymuch 3370 days ago

Yes that's what I've assumed as well

link