Hacker News new | ask | show | jobs
by taeric 6 days ago
I confess a sad assumption that bot traffic is far higher than we have admitted for a long time. Though, maybe we would see different stats specifically to social media sights to astroturf like counts? Certainly feels that we have known for a long time that bots were larger in ad viewing than ad companies wanted to admit.
2 comments

I don't understand what difference bots make. For me, a website (the public part) is a storefront. People walk down the street and see what's inside — that's the purpose. If something should not be available immediately, that's the private part of the store.

I've been monitoring bot traffic on digital platforms for over 10 years. Sure, the crawler share is growing, some even with malicious intentions, and those I detect and block.

I disagree that this pain is worth the cost of making real people spend their life on verification.

For ad views, the concern is specifically that people pay for clicks and views. That that can be so heavily influenced by bot traffic greatly undermines their value.

Same general idea goes for any of the algorithmic driven platforms. The algorithms are ostensibly intended to surface organically discovered things by watching how people interact with things. That they are so susceptible to distortion through bot farms should be a lot more acknowledged than it is. People trust them far more than they should.

There is also a general cost of running things concern. It isn't like it is completely free to execute on bot traffic.

For ads, I believe this must be a problem for ad platform owners.

If the digital platform's storefront is their business, they could afford to spend some budget on bot detection. Bots still come from data center networks, sometimes render pages incompletely, request resources in bulk, and show enough patterns to be flagged internally.

If we look at a medium website, most random crawlers will come from Amazon, Microsoft, DigitalOcean, Hetzner, OVH, and a few other DC networks — these can be blocked easily without harming real users. The rest can be detected and cleaned up, even manually.

The math is simple: 20,000 visits a day at 15 seconds each = ~83 hours a day lost watching a Cloudflare logo, just because someone doesn't want to dig into the logs. I don't buy it.

Largely agreed, though I think you are likely underestimating how hard this is to detect. In particular, it is true that many bots can be hosted in data centers, but it is somewhat trivial to launder that traffic through other sources. Malware, in particular, is what I have in mind. Maybe I'm wrong and that has largely gone away?

There is also a bit of mixed incentives. Yes, it is the ad platform that is getting abused. But it is also the ad platform that is charging people based on abused practices.

And it isn't like this is completely made up. Just look at how facebook killed a lot of ton of people during the "pivot to video" programs. I don't know all of the details, as I was thankfully not in any of the involved industries, but my understanding is it is fairly well documented.

Edit: I changed an "isn't" to "is." I think I was trying to reword at one point, but left it in a way that is opposite what I meant.

When most of your server capacity is going to answering the scrapers it matters. It's not that the stuff is hidden, it's that storefront being flooded with 10x as many customers as the fire code allows. And some of them go around asking your employees mindless questions. (Small forum I help moderate: we were getting hammered with what was probably some sort of AI that was taking search queries and feeding them into the forum search. Search is now registered users only.)
> When most of your server capacity is going to answering the scrapers it matters

I've been dealing with the web since the previous century and still haven't managed to build a website that could be hurt by scrapers visiting it.

If you went through the logs, you'd probably see that these bots are on a single IP or subnet, which can be easily detected and blocked instead of closing off search to non-registered users.

That's incorrect, they use residential proxy networks.
Botnet.

Our offending searches were coming from many addresses.

For efficiently-hosted sites with little media it's not too bad. E.g. hosting a static site just doesn't cost much, even if you're hammered occasionally.

That's extremely far from all sites though. It's probably safe to say it's a severe minority, particularly when you ignore personal / non-profit-bringing sites. Tons of small and large sites run stuff like poorly-written wordpress or ruby on rails or thousands of microservices doing god knows what. A major increase in request volume on those can easily mean significant increases in hosting charges (e.g. small-% on big, many multiples on small) or significant effort in optimizing (which is expensive too).

The website I mentioned has over 15k webpages and ~200 GB of media, and yet we monitor bots manually and only block them if they're pulling 5k requests in a row. Malicious URLs, multiply 404 are blocked by default. HEAD request rejected.

Even on a very bad day, the server's page load time doesn't go over 1s.

However, it seems like I'm indeed looking at the problem through the wrong prism, as what I've seen from the comments suggests that the initial issue is performance, and the bots are what uncover it.

I think a good chunk of it is bot-induced performance problems, yea. Whether that's compute or transfer. And advertisement costs.

Optimization is very very much not a solved problem though, just look at basically all software ever written - it's written for an optimization priority and to a price point (whether commercial $$ or via personal time), and that target's value to its users has shifted rather dramatically.

This is really interesting. I indeed looked at this problem from the wrong perspective.

I'm working on an open-source tool that could be useful for bot detection, but I'm still not confident that anyone would deploy it on-prem and make the setup/maintenance instead of just routing traffic through the cloud.

Perhaps performance as a KPI could work. Thanks!

I think you'd definitely find some interest, e.g. anyone that intentionally avoids "the cloud" will want something local. Honestly I assume there are some of these already, monitoring apache/nginx/etc logs. Anubis is arguably similar and has been exploding lately, for example, though I'm not sure if it auto-updates its rules at all: https://github.com/TecharoHQ/anubis

As to if it'd get enough interest: yea no idea at all. I wish you luck tho! Clearly there's a need for this kind of thing.

Well the fun things is that no one knows how much traffic of what kind they are getting when they use Cloudflare.

You get the numbers that Cloudflare tells you, but who knows if you can trust their stats after their CEO is apparently cherry-picking data to shape their product narrative?

That same CEO too that just went on a wild tone-def layoff justification, classifying human employees into roles of either a builder, seller, or measurer and saying he wants to get rid of everyone that "measures" the business...

I wouldn't trust a single thing coming out of his mouth.