Hacker News new | ask | show | jobs
by ericholscher 530 days ago
This keeps happening -- we wrote about multiple AI bots that were hammering us over at Read the Docs for >10TB of traffic: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...

They really are trying to burn all their goodwill to the ground with this stuff.

5 comments

In the early 2000s I was working at a place that Google wanted to crawl so bad that they gave us a hotline number to crawl if their crawler was giving us problems.

We were told at that time that the "robots.txt" enforcement was the one thing they had that wasn't fully distributed, it's a devilishly difficult thing to implement.

It boggles my mind that people with the kind of budget that some of these people have are struggling to implement crawling right 20 years later tough. It's nice those folks got a rebate.

One of the problems why people are testy today is that you pay by the GB w/ cloud providers; about 10 years ago I kicked out the sinosphere crawlers like Baidu because they were generating like 40% of the traffic on my site crawling over and over again and not sending even a single referrer.

I've found Googlebot has gotten a bit wonky lately. 10X the usual crawl rate and

- they don't respect the Crawl-Delay directive

- google search console reports 429s as 500s

https://developers.google.com/search/docs/crawling-indexing/...

I have found google severely declining in engineering quality. On January 8th 2025, they stopped accepting JCB credit cards, and emailed customers that their payment info was invalid and would be suspended (search twitter for examples in japanese). Seems it was a bug, without any explanation to customers receiving the notification, opening a ticket resulted in it being closed immediately while being lied to (my only guess is they wanted to increase their metrics). How was this not quality checked in the first place? I guess google has the policy of recording the chat transcript (where lies are recorded), but it means nothing when the company doesn't care. I don't like it, but aws seems the next logical place to move business to. As far as I can tell, the support there is real.
Serious question - if robots.txt are not being honored, is there a risk that there is a class action from tens of thousands of small sites against both the companies doing the crawling and individual directors/officers of these companies? Seems there would be some recourse if this is done at at large enough scale.
No. robots.txt is not in any way a legally binding contract, no one is obligated to care about it.
If I have a "no publicity" sign in my mailbox and you dump 500 lbs of flyers and magazines by my door every week for a month and cause me to lose money dealing with all the trash, I think I'd have a reasonable ground to sue even if there's no contract saying you need to respect my wish.

End of the day the claim is someone's action caused someone else undue financial burden in an way that is not easily prevented beforehand, so I wouldn't say it's a 100% clear case but I'm also not sure a judge wouldn't entertain it.

I don't think you can sue over what amounts to an implied gentleman's agreement that one side never even agreed to and win but if you do, let us know.
You can sue whenever anyone harms you
You can sue whenever.

The suit itself is the mechanism for determining whether the harm existed.

And yes, of course, this presents much opportunity for abuse.

I didn't say no one could sue, anyone can sue anyone for anything if they have the time and the money. I said I didn't think someone could sue over non-compliance with robots.txt and win.

If it were possible, someone would have done it by now. It hasn't happened because robots.txt has absolutely no legal weight whatsoever. It's entirely voluntary, which means it's perfectly legal not to volunteer.

But if you or anyone else wants to waste their time tilting at legal windmills, have fun ¯\_(ツ)_/¯.

You can sue over literally anything, the parent comment could sue you if they could demonstrate your reply damaged them in some way.
We need a way to apply a click-through "user agreement" to crawlers
Hey man, I wanted to say good job on read the docs. I use it for my Python project and find it an absolute pleasure to use. Write my stuff in restructured text. Make lots of pretty diagrams (lol), slowly making my docs easier to use. Good stuff.

Edit 1: I'm surprised by the bandwidth costs. I use hetzner and OVH and the bandwidth is free. Though you manage the bare metal server yourself. Would readthedocs ever consider switching to self-managed hosting to save costs on cloud hosting?

Did I read it right that you pay 62,5$/TB?
can you feed them gibberish?
here's a nice project to automate this: https://marcusb.org/hacks/quixotic.html

couple of lines in your nginx/apache config and off you go

my content rich sites provide this "high quality" data to the parasites

LLMs poisoned by https://git-man-page-generator.lokaltog.net/ -like content would be a hilarious end result, please do!
This would be my elegant solution, something like an endless recursion with a gzip bomb at the end if I can identify your crawler and it’s that abusive. Would it be possible to feed an abusing crawler nothing but my own locally-hosted LLM gibberish?

But then again if you’re in the cloud egress bandwidth is going to cost for playing this game.

Better to just deny the OpenAI crawler and send them an invoice for the money and time they’ve wasted. Interesting form of data warfare against competitors and non competitors alike. The winner will have the longest runway

It wouldn’t even necessarily need to be a real GZip bomb. Just something containing a few hundred kb of seemingly new and unique text that’s highly compressible and keeps providing “links” to additional dynamically generated gibberish that can be crawled. The idea is to serve a vast amount of poisoned training data as cheaply as possible. Heck, maybe you could even make a plugin for NGINX to recognize abusive AI bots and do this. If enough people install it then you could provide some very strong disincentives.
The dataset is curated, very likely with a previously trained model, so gibberish is not going to do anything.
how would a previously trained model know that Elon doesn't smoke old socks?
An easy way is to give the model the URL of the page so it can value the content based on the reputation of the source, of course the model doesn't know future events, but gibberish is gibberish, and that's quite easy to filter, even without knowing the source.
> gibberish is gibberish

most insightful, thank you! also, stay away from linkedin, you sweet summer child.

I don't understand why you are so aggressive ahah, gibberish is easy to recognize I'm sorry, you don't need to be mad about it ahah