| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fartfeatures 161 days ago
	> They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file." What mechanism does a site have for doing that? I don't see anything in robots.txt standard about being able to set priority but I could be missing something.

4 comments

arjie 161 days ago

The only real mechanism is "Disallow: /rendered/pages/*" and "Allow: /archive/today.gz" or whatever and there is no communication that the latter is the former. There is no machine-standard AFAIK that allows webmasters to communicate to bot operators in this detail. It would be pretty cool if standard CMSes had such a protocol to adhere to. Install a plugin and people could 'crawl' your Wordpress from a single dump or your Mediawiki from a single dump.

sbarre 160 days ago

A sitemap.xml file could get you most of the way there.

jacksnipe 161 days ago

It’s not great, but you could add it to the body of a 429 response.

VTimofeenko 161 days ago

Genuinely curious: do programs read bodies of 429 responses? In the code bases that I have seen, 429 is not read beyond the code itself

jakelazaroff 161 days ago

Sometimes! The server can also send a retry-after header to indicate when the client is allowed to request the resource again: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

deathanatos 161 days ago

… which isn't part of the body of a 429…

VTimofeenko 161 days ago

Well, to be fair, I did say "is not read beyond the code itself", header is not the code, so retry-after is a perfectly valid answer. I vaguely remember reading about it, but I don't recall seeing it used in practice. MDN link shows that Chrome derivatives support that header though, which makes it pretty darn widespread

jacksnipe 160 days ago

Up until very recently I would have said definitely not, but we're talking about LLM scrapers, who knows how much they've got crammed into their context windows.

gleenn 161 days ago

Almost certainly not by default, certainly not in any of the http libs I have used

dfxm12 160 days ago

If I find something useful there, I'll read it and code for it...

gloflo 160 days ago

This is about AI, so just believe what the companies are claiming and write "Dear AI, please would you be so kind as to not hammer our site with aggressive and idiotic requests but instead use this perfectly prepared data dump download, kthxbye. PS: If you don't, my granny will cry, so please be a nice bot. PPS: This is really important to me!! PPPS: !!!!"

I mean, that's what's this technology is capable of, right? Especially when one asks it nicely and with emphasis.

squigz 161 days ago

The mechanism is putting some text that points to the downloads.

TeMPOraL 161 days ago

So perhaps it's time to standardize that.

squigz 161 days ago

I'm not entirely sure why people think more standards are the way forward. The scrapers apparently don't listen to the already-established standards. What makes one think they would suddenly start if we add another one or two?

TeMPOraL 161 days ago

There is no standard, well-known way for a website to advertise, "hey, here's a cached data dump for bulk download, please use that instead of bulk scraping". If they were, I'd expect the major AI companies and other users[0] to use that method for gathering training data[1]. They have compelling reasons to: it's cheaper for them, and cultivates goodwill instead of burning it.

This also means that right now, it could be much easier to push through such standard than ever before: there are big players who would actually be receptive to it, so even few not-entirely-selfish actors agreeing on it might just do the trick.

--

[0] - Plenty of them exist. Scrapping wasn't popularized by AI companies, it's standard practice of on-line business in competitive markets. It's the digital equivalent of sending your employees to competing stores undercover.

[1] - Not to be confused with having an LLM scrap specific page for some user because the user requested it. That IMO is a totally legitimate and unfairly penalized/villified use case, because LLM is acting for the user - i.e. it becomes a literal user agent, in the same sense that web browser is (this is the meaning behind the name of "User-Agent" header).

ethin 161 days ago

You do realize that these AI scrapers are most likely written by people who have no idea what they're doing right? Or they just don't care? If they were, pretty much none of the problems these things have caused would exist. Even if we did standardize such a thing, I doubt they would follow it. After all, they think they and everyone else has infinite resources so they can just hammer websites forever.

fartfeatures 161 days ago

I realise you are making assertions for which you have no evidence. Until a standard exists we can't just assume nobody will use it, particularly when it makes the very task they are scraping for simpler and more efficient.

aembleton 160 days ago

Could be added to the llms.txt proposal: https://llmstxt.org/

edoceo 160 days ago

I'm in favor of /.well-known/[ai|llm].txt or even a JSON or (gasp!) XML.

Or even /.well-known/ai/$PLATFORM.ext which would have the instructions.

Could even be "bootstrapped" from /robots.txt