| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Takennickname 804 days ago
	It's a honeypot. He's telling people openai doesn't respect robots.txt and just scrapes whatever the hell it wants.

2 comments

cwillu 804 days ago

Except the first thing openai does is read robots.txt.

However, robots.txt doesn't cover multiple domains, and every link that's being crawled is to a new domain, which requires a new read of a robots .txt on the new domain.

link

flutas 804 days ago

> Except the fiist thing openai does is read robots.txt.

Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.

link

niutech 803 days ago

This robots.txt has Disallow rule commented out:

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

link

darkwater 804 days ago

And they do have (the same) robots.txt on every domain, tailored for GPTbot, i.e. https://petra-cody-carlene.web.sp.am/robots.txt

So, GPTBot is not following robots.txt, apparently.

link

AgentME 804 days ago

All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.

link

Pannoniae 804 days ago

It wasn't commented out a few hours ago when I checked it. I think that's a recent change.

link

cwillu 804 days ago

Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”

And then all the links are to external domains, which aren't subject to the first site's robots.txt

link

andybak 804 days ago

This is a moderately persuasive argument.

Although the crawler should probably ignore all the html body. But it does feel like a grey area if I accept your first pint.

link

kubanczyk 803 days ago

You've been able to convince me to accept his second pint. Friday it is.

link

fsckboy 804 days ago

humans don't read/respect robots.txt, so in order to pass the Turing test, ai's need to mimic human behavior.

link

gunapologist99 804 days ago

This must be why self-driving cars always ignore the speed limit. ;)

link

microtherion 804 days ago

More directly, e.g. Tesla boasts of training their FSD on data captured from their customer's unassisted driving. So it's hardly surprising that it imitates a lot of humans' bad habits, e.g. rolling past stop lines.

link

roughly 804 days ago

Jesus, that’s one of those ideas that looks good to an engineer but is why you really need to hire someone with a social sciences background (sociology, anthropology, psychology, literally anyone who’s work includes humans), and probably should hire two, so the second one can tell you why the first died of an aneurism after you explained your idea.

link

yreg 804 days ago

AI DRIVR claims that beta V12 is much better precisely because it takes rules less literally and drives more naturally.

link

queuebert 804 days ago

Did we just figure out a DoS attack for AGI training? How large can a robots.txt file be?

link

everforward 804 days ago

No, because there’s no legal weight behind robots.txt.

The second someone weaponizes robots.txt all the scrapers will just start ignoring it.

link

Retric 804 days ago

That’s how you weaponize it. Set things up to give endless/randomized/poisoned data to anybody that ignores robots.txt.

link

everforward 803 days ago

You mean human users? That is and always will be the dominant group of clients that ignore robots.txt.

What you’re talking about is an arms race wherein bots try to mimic human users and sites try to ban the bots without also banning all their human users.

That’s not a fight you want to pick when one of the bot authors also owns the browser that 63% of your users use, and the dominant site analytics platform. They have terabytes of data to use to train a crawler to act like a human, and they can change Chrome to make normal users act like their crawler (or their crawler act more like a Chrome user).

Shit, if Google wanted, they could probably get their scrapes directly from Chrome and get rid of the scraper entirely. It wouldn’t be without consequence, but they could.

link

Retric 803 days ago

It’s fairly trivial to treat Google’s crawler differently if you want. https://developers.google.com/search/docs/crawling-indexing/...

The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.

People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.

link

a_c 804 days ago

What about making it slow? One byte at a time for example while keeping the connection open

link

bityard 804 days ago

That would make it a tarpit, a very old technique to combat scrapers/scanners

link

happymellon 804 days ago

A slow stream that never ends?

link

SteveNuts 804 days ago

This would be considered a Slow Loris attack, and I'm actually curious how scrapers would handle it.

I'm sure the big players like Google would deal with it gracefully.

link

gtirloni 804 days ago

Here you go (1 req/min, 10 bytes/sec), please report results :)

  http {
    limit_req_zone $binary_remote_addr zone=ten_bytes_per_second:10m rate=1r/m;
    server {
      location / {
        if ($http_user_agent = "mimo") {
          limit_req zone=ten_bytes_per_second burst=5;
          limit_rate 10;
        }
      }
    }
  }

link

throw_a_grenade 804 days ago

You just set limits on everything (time, buffers, ...), which is easier said than done. You need to really understand your libraries and all the layers down to the OS, because its enough to have one abstraction that doesn't support setting limits and it's an invitation for (counter-)abuse.

link

Phelinofist 804 days ago

Sounds like endlessh

link

Takennickname 803 days ago

> Except the first thing openai does is read robots.txt.

What good is reading it if it doesn't respect it

link

GaggiX 804 days ago

It seems to respect it as the majority of the requests are for the robots.txt.

link

flutas 804 days ago

He says 3 million, and 1.8 million are for robots.txt

So 1.2 million non robots.txt requests, when his robots.txt file is configured as follows

    # buzz off
    User-agent: GPTBot
    Disallow: /

Theoretically if they were actually respecting robots.txt they wouldn't crawl any pages on the site. Which would also mean they wouldn't be following any links... aka not finding the N subdomains.

link

otherme123 804 days ago

A lot of crawlers, if not all, have a policy like "if you disallow our robot, it might take a day or two before it notices". They surely follow the path "check if we have robots.txt that allows us to scan this site, if we don't get and store robots.txt, scan at least the root of the site and its links". There won't be a second scan, and they consider that they are respecting robots.txt. Kind of "better ask for forgiveness than for permission".

link

jeremyjh 804 days ago

That is indistinguishable from not respecting robots.txt. There is a robots.txt on the root the first time they ask for it, and they read the page and follow its links regardless.

link

otherme123 804 days ago

I agree with you. I only stated how the crawlers seem to work, if you read their pages or try to block/slow down them it seems clear that they scan-first-respect-after. But somehow people understood that I approve that behaviour.

For those bad crawlers, which I very much disapprove, "not respecting robots.txt" equals "don't even read robots.txt, or if I read it ignore it completely". For them, "respecting robots.txt" means "scan the page for potential links, and after that parse and respect robots.txt". Which I disapprove and don't condone.

link

vertis 804 days ago

Except now it says

    # silly bing
    #User-agent: Amazonbot          
    #Disallow: /

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

    # Don't Allow everyone
    User-agent: *
    Disallow: /archive

    # slow down, dudes
    #Crawl-delay: 60

Which means he's changing it. The default for all other bots is to allow crawling.

link

jeffnappi 804 days ago

His site has a subdomain for every page, and the crawler is considering those each to be unique sites.

link

sangnoir 804 days ago

There are fewer than 10 links on each domain, how did GPTBot find out about the 1.8M unique sites? By crawling the sites it's not supposed to crawl, ignoring robots.txt. "disallow: /" doesn't mean "you may peek at the homepage to find outbound links that may have a different robots.txt"

link

jameshart 804 days ago

Of course it’s considering them as unique sites. They are unique sites.

link

swyx 804 days ago

for the 1.2 million are there other links he's not telling us about?

link

flutas 804 days ago

I'm assuming those are homepage requests for the subdomains.

link

swatcoder 804 days ago

I'm not sure any publisher means for their robots.txt to be read as:

"You're disallowed, but go head and slurp the content anyway so you can look for external links or any indication that maybe you are allowed to digest this material anyway, and then interpret that how you'd like. I trust you to know what's best and I'm sure you kind of get the gist of what I mean here."

link

mminer237 804 days ago

How would one know he is disallowed without reading each site?

link

swatcoder 804 days ago

The convention is that crawlers first read /robots.txt to see what they're encouraged to scrape and what they're not meant to, and then hopefully honor those directions.

In this case, as in many, the disallow rules are intentionally meant to protect the signal quality and efficiency of the crawler.

link