Hacker News new | ask | show | jobs
by jimrandomh 16 days ago
I deal with scrapers that sometimes border on DDoSes for LessWrong. The amount of bot traffic varies greatly between sites; if you have more URLs you get more bot traffic (regardless of whether those URLs represent a deep content catalog, or useless URL parameter permutations). It's bad for LW because of the content-catalog depth.

It's easy to drastically underestimate the amount of bot traffic, because bots make efforts (of varying sophistication) to look human enough to evade blocking. That includes using fake user-agent strings corresponding to real browsers (often but not always with implausibly old version numbers), proxying through residential IPs, and sometimes using full headless browsers. In my own data, traffic from badly behaved browser-impersonation bots exceeds traffic from named scrapers like GPTBot by something like 10x.

The measured percentage of bot traffic is higher for HTML than for other content types because many bots will load an HTML page, and then not load the JS/CSS/image/etc resources it references. But these are the least-sophisticated and most-detectable bots.

4 comments

Meta comes through with a /24 worth of scrapers and ignores robots.txt. I'm inclined to poison my data with fake information about Zuckerberg.
Did you check IP addresses, are they all from AS32934?
Yes

57.141.0.42 - - [05/Jun/2026:19:50:19 +0000] "GET /mid/a017bc62-0982-42db-8403-241d69da8d0f@alexander-goetzenstein.my-fqdn.de HTTP/2.0" 303 0 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.48 - - [05/Jun/2026:19:50:22 +0000] "GET /group/comp.os.linux.advocacy/a/a236f5a5-63a4-4982-8bb6-07ffc684201b@googlegroups.com HTTP/2.0" 200 34838 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.55 - - [05/Jun/2026:19:50:23 +0000] "GET /group/alt.recovery.aa/a/ne6onq%24hpp%241@dont-email.me HTTP/2.0" 200 5606 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.56 - - [05/Jun/2026:19:50:24 +0000] "GET /group/aioe.news.assistenza/a/qpukie%241i1g%241@neodome.net?view=headers HTTP/2.0" 200 17027 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.36 - - [05/Jun/2026:19:50:29 +0000] "GET /group/alt.obituaries/a/uf8pej%241hqi1%241@news.xmission.com HTTP/2.0" 200 6123 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.66 - - [05/Jun/2026:19:50:29 +0000] "GET /group/comp.theory/a/v3640k%24vg63%243@dont-email.me HTTP/2.0" 200 148720 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

And assume you have

User-agent: meta-externalagent

Disallow: /

I have observed the same from Meta's crawler.

  User-agent: *
  Disallow: /
on e.g. our preproduction site, Meta is the only big-tech crawler that accesses it, at least with an honest user agent. (Meta also accesses disallowed paths on the production site.)
I'm not defending meta here, but I should mention that meta also uses crawlers to visit pages when someone send a link through their services.

   User-agent: *
can be ignored by bots, but if they ignore the disallow rule for their own UA, they can easily be blocked by network AS.
They don't obey *, they don't get their own entry. I'd rather just poison their data, it's a well known behavior from them.

https://www.reddit.com/r/webdev/comments/1sdzd1q/metas_ai_cr...

Over at AppleInsider, the amount of bot traffic is insane, and it's gotten to the point where it's starting to even jack up some of our Google Analytics. No, GA we didn't see a natural 320% rise in visitors from Singapore...

It's almost a full time job to manage. Our WAF rules are 90% 'oh shit AI scraper bots found a new vector.' It doesn't border on DDoS -- it effectively is. Coupled with all the Google changes that started a few years ago -- which is a separate topic I could rank about for ages -- 2026 is just a VERY bad time to be a website owner. I actually wiped most of our robots.txt rules the other day because literally nobody followed them. Anybody who tells you otherwise is flatly lying.

This is the new normal though, we gotta try and figure it out

Does LW have a downloadable archive? I can only find references to GreaterWrong but no public answer. Would be useful.
thank you for maintaining LessWrong