Hacker News new | ask | show | jobs
by 1vuio0pswjnm7 297 days ago
"Some of these crawlers appear to be designed to avoid rate limiting based on IP."

Unless the rate is exceeded, the limit is not being avoided

"I regularly see millions of unique ips doing strange requests, each just one or at most a few per day."

Assuming the rate limit is more than one or a few requests every 24h this would be complying with the limit, not avoiding it

It could be that sometimes the problem website operators are concerned about is not "website load", i.e., the problem the article is discussing, it is actually something else (NB. I am not speculating about this particular operator, I am making a general observation)

If a website is able to fulfill all requests from unique IPs without affecting quality of service, then it stands to reason "website load" is not a problem the website operator is having

For example, the article's title claims Meta is amongst the "worst offenders" of creating excessive website load caused by "AI crawlers, fetchers"

Meta has been shown to have used third party proxy services wth rotating IP addresses in order to scrape other websites; it also sued one of these services because it was being used to scrape Meta's website, Facebook

https://brightdata.com/blog/general/meta-dismisses-claim-aga...

Whether the problem that Meta was having with this "scraping" was "website load" is debatable; if the requests were being fulfilled without affecting QoS, then arguably "website load" was not a problem

Rate-limiting addresses the problem of website load; it allows website operators to ensure that requests from all IP addresses are adequately served as opposed to preferentially servicing some IP addresses to the detriment of others (degraded QoS)

Perhaps some website operators become concerned that many unique IP addresses may be under the control of a single entity, and that this entity may be a competitor; this could be a problem for them

But if their website is able to fulfill all the requests it receives without degrading QoS then arguably "website load" is not a problem they are having

NB. I am not suggesting that a high volume of requests from a single entity, each complying with a rate-limit is acceptable, nor am I making any comment about the practice of "scraping" for commercial gain. I am only commenting about what rate-limiting is designed to do and whether it works for that purpose