| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Octokiddie 804 days ago
	I'm more interested in what that content farm is for. It looks pointless, but I suspect there's a bizarre economic incentive. There are affiliate links, but how much could that possibly bring in?

6 comments

throw_a_grenade 804 days ago

This is honeypot. The author, https://en.wikipedia.org/wiki/John_R._Levine, keeps it just to notice any new (significant) scraping operation launched that will invariably hit his little farm and let be seen in the logs. He's well known anti-spam operative with his various efforts now dating back multiple decades.

Notice how he casually drops a link to the landing page in the NANOG message. That's how the bots will get a bait.

madkangas 804 days ago

I recognize the name John Levine at iecc.com, "Invincible Electric Calculator Company," from web 1.0 era. He was the moderator of the Usenet comp.compilers newsgroup and wrote the first C compiler for the IBM PC RT

https://compilers.iecc.com/

gtirloni 804 days ago

It'd say it's more like a honeypot for bots. So pretty similar objectives.

Octokiddie 804 days ago

So it served its purpose by trapping the OpenAI spider? If so, why post that message? As a flex?

Takennickname 804 days ago

It's a honeypot. He's telling people openai doesn't respect robots.txt and just scrapes whatever the hell it wants.

cwillu 804 days ago

Except the first thing openai does is read robots.txt.

However, robots.txt doesn't cover multiple domains, and every link that's being crawled is to a new domain, which requires a new read of a robots .txt on the new domain.

flutas 804 days ago

> Except the fiist thing openai does is read robots.txt.

Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.

niutech 803 days ago

This robots.txt has Disallow rule commented out:

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

darkwater 804 days ago

And they do have (the same) robots.txt on every domain, tailored for GPTbot, i.e. https://petra-cody-carlene.web.sp.am/robots.txt

So, GPTBot is not following robots.txt, apparently.

AgentME 804 days ago

All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.

cwillu 804 days ago

Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”

And then all the links are to external domains, which aren't subject to the first site's robots.txt

fsckboy 804 days ago

humans don't read/respect robots.txt, so in order to pass the Turing test, ai's need to mimic human behavior.

queuebert 804 days ago

Did we just figure out a DoS attack for AGI training? How large can a robots.txt file be?

everforward 804 days ago

No, because there’s no legal weight behind robots.txt.

The second someone weaponizes robots.txt all the scrapers will just start ignoring it.

a_c 804 days ago

What about making it slow? One byte at a time for example while keeping the connection open

Takennickname 803 days ago

> Except the first thing openai does is read robots.txt.

What good is reading it if it doesn't respect it

GaggiX 804 days ago

It seems to respect it as the majority of the requests are for the robots.txt.

flutas 804 days ago

He says 3 million, and 1.8 million are for robots.txt

So 1.2 million non robots.txt requests, when his robots.txt file is configured as follows

    # buzz off
    User-agent: GPTBot
    Disallow: /

Theoretically if they were actually respecting robots.txt they wouldn't crawl any pages on the site. Which would also mean they wouldn't be following any links... aka not finding the N subdomains.

otherme123 804 days ago

A lot of crawlers, if not all, have a policy like "if you disallow our robot, it might take a day or two before it notices". They surely follow the path "check if we have robots.txt that allows us to scan this site, if we don't get and store robots.txt, scan at least the root of the site and its links". There won't be a second scan, and they consider that they are respecting robots.txt. Kind of "better ask for forgiveness than for permission".

vertis 804 days ago

Except now it says

    # silly bing
    #User-agent: Amazonbot          
    #Disallow: /

    # buzz off
    #User-agent: GPTBot
    #Disallow: /

    # Don't Allow everyone
    User-agent: *
    Disallow: /archive

    # slow down, dudes
    #Crawl-delay: 60

Which means he's changing it. The default for all other bots is to allow crawling.

jeffnappi 804 days ago

His site has a subdomain for every page, and the crawler is considering those each to be unique sites.

swyx 804 days ago

for the 1.2 million are there other links he's not telling us about?

swatcoder 804 days ago

I'm not sure any publisher means for their robots.txt to be read as:

"You're disallowed, but go head and slurp the content anyway so you can look for external links or any indication that maybe you are allowed to digest this material anyway, and then interpret that how you'd like. I trust you to know what's best and I'm sure you kind of get the gist of what I mean here."

mminer237 804 days ago

How would one know he is disallowed without reading each site?

dspillett 804 days ago

So, it has worked…

pflanze 803 days ago

Linkers & Loaders is their own book (I haven't checked the others).

They have a page at https://www.iecc.com/linker/ where they used to publish a draft of the book contents, but changed the page to say "Chapters were available in an excessive variety of formats, but are not any longer due to chronic piracy", when it got posted to HN at https://news.ycombinator.com/item?id=18424233 and I bundled the files for offline reading. I notified them via email about that asking if they are OK with it but got an unfriendly response that I pirated the files and that wasn't OK, so I took the link down again and they changed that text. (Shrug. I'm not a/the book author, they are. I'll say that I also suggested to them they ask on the page not to do what I did since then I wouldn't have, but they chose their more radical approach.)

agilob 804 days ago

It's for shits-and-giggles and it's doing its job really well right now. Not everything needs to have an economic purpose, 100 trackers, ads and backed by a company.

phyzome 804 days ago

And yet it has the Amazon links, which makes it appear to have some economic purpose...

phyzome 804 days ago

...oh right, it's probably so that every page has affiliate links, which should be a signal of low quality to a crawler.

schleck8 804 days ago

The books on there are affiliate links I think.