| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jsheard 698 days ago

> I'm glad that most of them seem to respect robots.txt.

https://github.com/ai-robots-txt/ai.robots.txt/blob/main/tab...

Some of them identify themselves by user agent but don't respect robots.txt, so you have to set up your server to 403 their requests to keep them out. If they start obfuscating their user agents then there won't be an easy solution besides deferring to a platform like CloudFlare which offers to play that cat and mouse game on your behalf though.

2 comments

lolinder 698 days ago

The entry here for Perplexity is the one that got a lot of attention but it's also unfair: PerplexityBot is their crawler, which uses that user agent and as far as anyone can tell it respects robots.txt.

They also have a feature that will, if a user pastes a URL into their chat, go fetch the data and do something with it in response to the user's query. This is the feature that made a big kerfuffle on HN a while back when someone noticed it [0].

That second feature is not a web crawler in any meaningful sense of the word "crawler". It looks up exactly one URL that the user asked for and does something with it. It's Perplexity acting as a User Agent in the original sense of the word: a user's agent for accessing and manipulating data on the open web.

If an AI agent manipulating a web page that I ask it to manipulate in the way I ask it to manipulate it is considered abusive then so are ad blockers, reader mode, screen readers, dark reader, and anything else that gives me access to open web content in a form that the author didn't originally intend.

[0] https://news.ycombinator.com/item?id=40690898

link

ffsm8 698 days ago

No, thats illogical.

The action is indeed prompted by a human, but so is any crawl in some way. At some point they either configured an interval or other trigger to send the script to the Web host to fetch anything it can find.

It's inherently different to extensions such as adblockers that just remove elements according to configuration.

After all, the users device will never even see the final DOM now. instead it's getting fetched, parsed and processed on a third device, which is objectively a robot. You'd be able to make that argument only if it was implemented via an extension (users device fetched the page and posts the final document to the LLM for processing).

And that's ignoring the fact that adblockers are seen as illegitimate by a lot of websites too, and they often try to block access to people using these extensions too.

link

lolinder 698 days ago

I wrote a reply but you edited out the chunk of text that I quoted, so here's a new reply.

> After all, the users device will never even see the final DOM now. instead it's getting fetched, parsed and processed on a third device, which is objectively a robot.

Sure, but why does it matter if the machine that I ask to fetch, parse, and process the DOM lives on my computer or on someone else's? I, the human being, will never see the DOM either way.

This distinction between my computer and a third-party computer quickly falls apart when you push at it.

If I issue a curl request from a server that I'm renting, is that a robot request? What about if I'm using Firefox on a remote desktop? What about if I self-host a client like Perplexity on a local server?

We live in an era where many developers run their IDE backend in the cloud. The line between "my device" and "cloud device" has been nearly entirely blurred, so making that the line between "robot" and "not robot" is entirely irrational in 2024.

The only definition of "robot" or "crawler" that makes any kind of sense is the one provided by robotstxt.org [0], and it's one that unequivocally would incorporate Perplexity on the "not robot" side:

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. ... Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

Or the MDN definition [1]:

> A web crawler is a program, often called a bot or robot, which systematically browses the Web to collect data from webpages. Typically search engines (e.g. Google, Bing, etc.) use crawlers to build indexes.

Perplexity issues one web request per human interaction and does not fetch referenced pages. It cannot be considered a "crawler" by either of these definitions, and the definition you've come up with just doesn't work in the era of cloud software.

[0] https://www.robotstxt.org/faq/what.html

[1] https://developer.mozilla.org/en-US/docs/Glossary/Crawler

link

ffsm8 697 days ago

I'm honestly confused here, if anything, aren't your quotes literally confirming my point?

It's triggering an automation which fetches data. This is a crawl, even if the crawl has a very limited scope (it's also not limited to a single request, that's just the scope that's used by default. But even if it was programmatically limited to only ever request a single resource, that'd still be a crawl, while recursion is the norm too build indexes, it's not necessary for all usecases that utilize crawlers.

Did you ever actually make anything that's utilizing them to gather information that you want? You might be surprised to know that adhoc triggering a singular resource fetch is actually pretty common to keep data up-to-date.

> If I issue a curl request from a server that I'm renting, is that a robot request? What about if I'm using Firefox on a remote desktop? What about if I self-host a client like Perplexity on a local server?

Yes, anything on a third device is effectively a robot that's acting on the behalf of the acteur.

link

wilg 698 days ago

If I were making a search engine or AI crawler, I would simply pose as Googlebot

link

jsheard 698 days ago

Google actually provides means of validating whether a request really came from them, so masquerading as Googlebot would probably backfire on you. I would expect the big CDNs to flag your IP address as malicious if you fail that check.

https://developers.google.com/search/docs/crawling-indexing/...

link

wilg 698 days ago

You could maybe still only follow robots.txt rules for Googlebot.

link