Hacker News new | ask | show | jobs
by randomstring 1491 days ago
Search traffic has always been mostly automated spam bots.

Even back in the Open Directory Days when we powered part of search.netscape.com I estimated 80+% of all search traffic was automated. At least most of it self-identified with the same Java useragent.

Later when working Topix, despite being a news search engine, most traffic was bot traffic. Most included the word “mortgage” in the query. Topix specialized in localized content, and that was very popular for SEO scrapers.

Lastly at Blekko, I estimate 90+% of traffic was automated. By then maybe half or more learned to change the user agent. Most used HTTP/1.0, a dead giveaway as no browser still uses 1.0. This was a major aspect in Blekko's load shedding strategy. If the servers started to get overloaded, we'd start bouncing suspected bot traffic to a redirect that would show in the logs. If there was a human with a modern browser running javascript on the other end, would get redirect to a link that wouldn't get bounced. I would check the logs weekly to see if any humans got caught. None ever did. This was a huge monetary savings, you only need 1/10th the servers if you can safely ignore the bots.

Often it's endless repetition of the same keywords in a random order with a place name appended, or prepended, or inserted. over and over. Often variations on known monetizatable SEO keywords. However, much of it doesn't make any sense.

I don't have any insight into Google's numbers but I would conservatively estimate 95% or more of all their queries are automated bots and not humans. And the level of spy-vs-spy going on for Google CPU resources vs SEO bots is probably pretty evolved by now. I stopped tracking many years ago when Google switched to densely packed obfuscated javascript for page renders. Maybe this is part of why automated queries are so high across the web, maybe google is too hard to crack for most.

2 comments

appreciate the sharing of info here.

I have recently been discovering and combating some similar, albeit much smaller issues.

I've been finding that a bunch of my recent 'resource sucks' have been constant spidering from petal-bot, semrush bot, alibiba-bot and a few others.

Using the wordpress plugin stop-bad-bots and it's logs has been eye-opening for me recently.

I understand many of these are not directly dark-seo related, but their aggressive nature is hurting the cpu and memory limits of some of my servers and sites so it's a big issue regardless of the intents behind them.

(kind of) glad someone else has dealt with these issues, and glad to see some of the 'how' for handling, identifying, and some actual real numbers for the impacts, as I've been guessing some of these things in my small projects, indeed it's a real thing. As well as a practical issue to pay attention to and work on.

Almost sounds like it is justified to add a javascript crypto miner to your pages to make the bots pay for the use of your service.
The point is that the vast majority of scrapers do not bother to run javascript.