| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zargath 394 days ago

Sounds very basic, sadly.

Anybody know why these web crawling/bot standards are not evolving ? I believe robots.txt was invented in 1994(thx chatgpt). People have tried with sitemaps, RSS and IndexNow, but its like huge$$ organizations are depending on HelloWorld.bas tech to control their entire platform.

I want to spin up endpoints/mcp/etc. and let intelligent bots communicate with my services. Let them ask for access, ask for content, pay for content, etc. I want to offer solutions for bots to consume my content, instead of having to choose between full or no access.

I am all for AI, but please try to do better. Right now the internet is about to be eaten up by stupid bot farms and served into chat screens. They dont want to refer back to their source and when they do its with insane error rates.

3 comments

stereolambda 394 days ago

> I believe robots.txt was invented in 1994(thx chatgpt).

Not to pick on you, but I find it quicker to open new tab and do "!w robots.txt" (for search engines supporting the bang notation) or "wiki robots.txt"<click> (for Google I guess). The answer is right there, no need to explain to LLM what I want or verify [1].

[1] Ok, Wikipedia can be wrong, but at least it is a commonly accessible source of wrong I can point people to if they call me out. Plus my predictive model of Wikipedia wrongness gives me pretty low likelihood for something like this, while for ChatGPT it is more random.

link

reaperducer 394 days ago

robots.txt was invented in 1994(thx chatgpt)

Thought of and discussed as a possibility in 1994.

Proposed as a standard in 2019.

Adopted as a standard in 2022.

Thanks, IETF.

link

Dylan16807 394 days ago

This phrasing is very misleading. To bullet point directly from "possibility" to "standard" implies the standardization was a turning point where it could start being used. But it was massively used long before that. The standard is a side note that's barely relevant.

link

reaperducer 393 days ago

It only became massively used in 2019, when Google recommended it.

link

Dylan16807 393 days ago

Where did you get that date?

https://serverfault.com/questions/171985/how-can-i-encourage...

Here's a 2010 discussion about Google's explicit support, and I'm sure I could find earlier.

The thing google did in 2019 was submit it as a standard, nothing to do with adoption or starting to recommend. In that very post they said "For 25 years, the Robots Exclusion Protocol (REP) has been one of the most basic and critical components of the web" "The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP."

link

reaperducer 393 days ago

Where did you get that date?

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under Internet Engineering Task Force.[8]

https://en.wikipedia.org/wiki/Robots.txt

link

Dylan16807 393 days ago

That is not when they started recommending it. It would be nice if you acknowledged the rest of my comment, I even quoted from the [8] reference.

link

TechDebtDevin 394 days ago

This comment seems like it comes from a Cloudflare employee.

This is clearly the first step in cf building out a marketplace where they will (fail) at attempting to be the middleman in a useless market between crawlers and publishers.

link

zargath 394 days ago

nah, disappointed cf customer

link