Hacker News new | ask | show | jobs
by LinuxBender 688 days ago
There are a myriad of bot lists on github and assorted websites, each with their own quality in being maintained. Given the public nature of these lists it is unlikely anyone is going to pay for them. As others mentioned, these are ephemeral. Bots are operated by opportunists and they will hop from network to network to find addresses not yet tainted. There are also a large number of bots that use mobile LTE networks. IP addresses are not entirely useful for determining if something is a bot. Rather their TCP/IP characteristics, application characteristics, behavior will be a greater indicator. As simplistic and non exhaustive examples, SSH bots are often using really old SSH libraries that can be spotted in the handshake. Some HTTP bots are also often using really old libraries that can not utilize newer protocols with obvious exceptions like those driving headless chrome. Some bots use really old TCP/IP libraries that lack setting certain IP headers and options or fixate on specific options or source ports. Some bots also hide behind proxies, Tor, CDN's. They also hide their DNS lookups behind all the public DNS resolvers. The vast majority of bot authors are lazy and looking for quick wins.

I suppose that is a long winded way of saying I doubt there would be much interest in paying for IP lists as the greater value lies in understanding traffic behavior and packet characteristics. Perhaps if you started writing advanced eBPF code that could mostly-accurately separate traffic into bot vs non-bot then you would have created a piece of Cloudflare and people might pay to self host that. If going this route one should create public challenges to validate the accuracy and ability to "spot the bots". Independent third parties must participate in the challenge to validate both the ability to "spot the bots" and also not block legitimate people. That would be valuable. To garner interest there should be a free version.

I can speak from experience that some companies are not permitted to send specific types of traffic over a third party such as a CDN and that would be the use case for self hosted bot mitigation. Some companies try to accomplish this using bot mitigation in hardware load balancers, multi-million dollar firewalls and DDoS appliances. These devices are expensive and do not scale well not to mention they only stop very specific types of bots and attacks. These devices are also sometimes the causes of outages. In my experience, the more expensive a device is and the more promises around said device, the more glorious of an outage it will cause.