Hacker News new | ask | show | jobs
by gtirloni 804 days ago
It'd say it's more like a honeypot for bots. So pretty similar objectives.
2 comments

So it served its purpose by trapping the OpenAI spider? If so, why post that message? As a flex?
It's a honeypot. He's telling people openai doesn't respect robots.txt and just scrapes whatever the hell it wants.
Except the first thing openai does is read robots.txt.

However, robots.txt doesn't cover multiple domains, and every link that's being crawled is to a new domain, which requires a new read of a robots .txt on the new domain.

> Except the fiist thing openai does is read robots.txt.

Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.

This robots.txt has Disallow rule commented out:

    # buzz off
    #User-agent: GPTBot
    #Disallow: /
And they do have (the same) robots.txt on every domain, tailored for GPTbot, i.e. https://petra-cody-carlene.web.sp.am/robots.txt

So, GPTBot is not following robots.txt, apparently.

All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.
It wasn't commented out a few hours ago when I checked it. I think that's a recent change.
Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”

And then all the links are to external domains, which aren't subject to the first site's robots.txt

This is a moderately persuasive argument.

Although the crawler should probably ignore all the html body. But it does feel like a grey area if I accept your first pint.

humans don't read/respect robots.txt, so in order to pass the Turing test, ai's need to mimic human behavior.
This must be why self-driving cars always ignore the speed limit. ;)
Did we just figure out a DoS attack for AGI training? How large can a robots.txt file be?
No, because there’s no legal weight behind robots.txt.

The second someone weaponizes robots.txt all the scrapers will just start ignoring it.

That’s how you weaponize it. Set things up to give endless/randomized/poisoned data to anybody that ignores robots.txt.
What about making it slow? One byte at a time for example while keeping the connection open
That would make it a tarpit, a very old technique to combat scrapers/scanners
A slow stream that never ends?
Sounds like endlessh
> Except the first thing openai does is read robots.txt.

What good is reading it if it doesn't respect it

It seems to respect it as the majority of the requests are for the robots.txt.
He says 3 million, and 1.8 million are for robots.txt

So 1.2 million non robots.txt requests, when his robots.txt file is configured as follows

    # buzz off
    User-agent: GPTBot
    Disallow: /
Theoretically if they were actually respecting robots.txt they wouldn't crawl any pages on the site. Which would also mean they wouldn't be following any links... aka not finding the N subdomains.
A lot of crawlers, if not all, have a policy like "if you disallow our robot, it might take a day or two before it notices". They surely follow the path "check if we have robots.txt that allows us to scan this site, if we don't get and store robots.txt, scan at least the root of the site and its links". There won't be a second scan, and they consider that they are respecting robots.txt. Kind of "better ask for forgiveness than for permission".
That is indistinguishable from not respecting robots.txt. There is a robots.txt on the root the first time they ask for it, and they read the page and follow its links regardless.
Except now it says

    # silly bing
    #User-agent: Amazonbot          
    #Disallow: /

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

    # Don't Allow everyone
    User-agent: *
    Disallow: /archive

    # slow down, dudes
    #Crawl-delay: 60
Which means he's changing it. The default for all other bots is to allow crawling.
His site has a subdomain for every page, and the crawler is considering those each to be unique sites.
There are fewer than 10 links on each domain, how did GPTBot find out about the 1.8M unique sites? By crawling the sites it's not supposed to crawl, ignoring robots.txt. "disallow: /" doesn't mean "you may peek at the homepage to find outbound links that may have a different robots.txt"
Of course it’s considering them as unique sites. They are unique sites.
for the 1.2 million are there other links he's not telling us about?
I'm assuming those are homepage requests for the subdomains.
I'm not sure any publisher means for their robots.txt to be read as:

"You're disallowed, but go head and slurp the content anyway so you can look for external links or any indication that maybe you are allowed to digest this material anyway, and then interpret that how you'd like. I trust you to know what's best and I'm sure you kind of get the gist of what I mean here."

How would one know he is disallowed without reading each site?
The convention is that crawlers first read /robots.txt to see what they're encouraged to scrape and what they're not meant to, and then hopefully honor those directions.

In this case, as in many, the disallow rules are intentionally meant to protect the signal quality and efficiency of the crawler.

So, it has worked…