Hacker News new | ask | show | jobs
by paco3346 1694 days ago
I'm right there with you. I'm the lead engineer for an automotive SaaS provider (with ~6000 customers and ~4 billion requests per month) and we recently started moving all our services to Cloudflare's WAF to take advantage of their bot protection. We were getting scrapes from botnets in the 100000+ per minute range that was affecting performance.

We chose to switch to the JS challenge screen as it requires no human interaction. We now block 75% (estimated to the best of our knowledge) of bot traffic but some customers are livid over the challenge screen.

3 comments

I'm really surprised that the JS challenges helped so much, given that there are open source libraries for bypassing them (e.g. cloudscraper[0]).

[0]: https://github.com/venomous/cloudscraper

If someone wanted to get past it they probably could. We've had a few sources of traffic that we've had to straight up block (as opposed to challenge) because of this exact issue. So far it's been a "good enough" solution that blocks enough of the bot traffic to be effective.
What were they scraping, if I can ask? Was it targeted or just wget -r style?
It was a hybrid of low-effort vulnerability scanning and targeted inventory scraping. Many dealerships in the automotive space will pay gray-hat third parties to scrape and compile data on their competitors.

The irony for us as a provider is that it's one of our customers (party A) paying a third party to scrape data from another one of our customers (party B) which in turn affects the performance of party A's site. We've started blocking these third parties and directing them to paid APIs that we offer.

And how do you get your 'inventory data'? Aren't you scraping (or using scraped data) yourself? Oh the irony :)
No, we're a contracted provider for these customers. They ingest their data into our network through APIs or CSVs.
Makes little sense - customers upload data to you and they don't want any data back? Really?
It's not them who want it back, it's their competitors who want it.
Why do you think those bots were scraping your data in the first place?