|
|
|
|
|
by rozap
499 days ago
|
|
What is a "junk" request? Is it hammering an expensive endpoint 5000 times per second, or just somebody using your website in a way you don't like? I've also been on both sides of it (on-call at 3am getting dos'd is no fun), but I think the danger here is that we've gotten to a point where a new google can't realistically be created. The thing is that these tools are generally used to further entrench power that monopolies, duopolies, and cartels already have. Example: I've built an app that compares grocery prices as you make a shopping list, and you would not believe the extent that grocers go to to make price comparison difficult. This thing doesn't make thousands or even hundreds of requests - maybe a few dozen over the course of a day. What I thought would be a quick little project has turned out to be wildly adversarial. But now spite driven development is a factor so I will press on. It will always be a cat and mouse game, but we're at a point where the cat has a 46 billion dollar market cap and handles a huge portion of traffic on the internet. |
|
They ignored robots.txt (claimed not to, but I blacklisted them there and they didn't stop) and started randomly generating image paths. At some point /img/123.png became /img/123.png?a=123 or whatever, and they just kept adding parameters and subpaths for no good reason. Nginx dutifully ignored the extra parameters and kept sending the same images files over and over again, wasting everyone's time and bandwidth.
I was able to block these bots by just blocking the entire IP range at the firewall level (for Huawei I had to block all of China Telecom and later a huge range owned by Tencent for similar reasons).
I have lost all faith in scrapers. I've written my own scrapers too, but almost all of the scrapers I've come across are nefarious. Some scour the internet searching for personal data to sell, some look for websites to send hack attempts at to brute force bug bounty programs, others are just scraping for more AI content. Until the scraping industry starts behaving, I can't feel bad for people blocking these things even if they hurt small search engines.