| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lucb1e 406 days ago

There is no automation, I use `tail -f access.log`

I just look at what's happening on my server every now and then. Sometimes not for months, but then when I set up a project like that caching proxy, I'm currently keeping a more regular eye to see that crawlers aren't bothering the upstream via me. Most respect the robots policy, most of the ones that don't set a user agent string that include the word 'bot' and so I know not to refresh the cache based on that request. So far it has mostly been Huawei who pretend to be a regular user but request millions of pages (from 12 separate IP ranges so far, some of them bigger than /16, some of them a handful of /24s).

> Could you say more about how to identify these bad-bots?

Many requests per day to random pages from either the same IP address (range), or ranges owned by the same corporation

1 comments

reconnecting 406 days ago

Interesting. Our open-source platform [1] has the capacity to help with all of this through a GUI and rule engine, but I'm still concerned about whether we should present this way of bot hunting as a feature. I worry that this approach may be irrelevant in today's context.

[1] https://github.com/TirrenoTechnologies/tirreno

link

tough 406 days ago

I mean if anything with AI data and main sources are becoming the actual precious resource again.

So i'd expect an uptick in bots as everyone races to try and compete with google on data hoarding

link

reconnecting 406 days ago

As I can see, there is already a heavy wave of new AI/startup/VC etc. data companies that goes beyond the data consumption expectations of websites in the pre-AI era.

However, I see the development of new bot types that tackle security in more aggressive ways. It's not just simple SQL injection as it was before, but more sophisticated and custom bots that not only request but also push a lot.

Or just a couple of days ago, I found a new type of bot that "brute-forces" website folder structure. ~205,000 requests in a couple of days.

These new bots are probably not directly the work of AI, but they seem to be a consequence of it.

link