|
|
|
|
|
by AbortedLaunch
298 days ago
|
|
Some of these crawlers appear to be designed to avoid rate limiting based on IP. I regularly see millions of unique ips doing strange requests, each just one or at most a few per day. When a response contains a unique redirect I often see a geographically distinct address fetching the destination. |
|
How would UA string help
For example, a crawler making "strange" requests can send _any_ UA string, and a crawler doing "normal" requests can also send _any_ UA string.
The "doing requests" is what I refer to as "behaviour"
A website operator might think "Crawlers making strange requests send UA string X but not Y"
Let's assume the "strange" requests cause a "website load" problem^1
Then a crawler, or any www user, makes a "normal" request and sends UA string X; the operator blocks or redirects the request, unnecessarily
Then a crawler makes "strange" request and sends UA string Y; the operator allows the request and the website "blows up"
What matters for the "blowing up websites" problem^1 is behaviour, not UA string
1. The article's title calls it the "blowing up websites" problem, but the article text calls it a problem with "website load". As always the details are missing. For example, what is the "load" at issue. Is it TCP connections or HTTP requests. What number of simultaneous connections and/or requests per second is acceptable, what number is not unacceptable. Again, behaviour is the issue, not UA string
The acceptable numbers need to be published; for example, see documentation for "web APIs"