|
|
|
|
|
by lucb1e
406 days ago
|
|
There is no automation, I use `tail -f access.log` I just look at what's happening on my server every now and then. Sometimes not for months, but then when I set up a project like that caching proxy, I'm currently keeping a more regular eye to see that crawlers aren't bothering the upstream via me. Most respect the robots policy, most of the ones that don't set a user agent string that include the word 'bot' and so I know not to refresh the cache based on that request. So far it has mostly been Huawei who pretend to be a regular user but request millions of pages (from 12 separate IP ranges so far, some of them bigger than /16, some of them a handful of /24s). > Could you say more about how to identify these bad-bots? Many requests per day to random pages from either the same IP address (range), or ranges owned by the same corporation |
|
[1] https://github.com/TirrenoTechnologies/tirreno