| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by CaptainOfCoit 742 days ago

> does your bot respect robots.txt directives?

Would be a bit strange if it did, as the service is not a crawler/robot by any measures.

Bit like asking if cURL is "respecting" robots.txt.

It's just another user-agent after all.

1 comments

dumbfounder 742 days ago

It is a service that seems to crawl a website for content and feed that content into some LLMs. It should absolutely respect robots.txt. This is exactly what robots.txt is used for, to tell automated crawlers of a website what they should and should not do.

link

IncreasePosts 742 days ago

I disagree - this is not a crawler that just blindly stumbles around any random website that it finds. It is more akin to a user agent. The only requests it makes are derived from specific instructions by the user to do so.

Having said that, people may use it as a crawler, just like you might be able to script Firefox to be a crawler, but it is not in itself a crawler.

link

dumbfounder 741 days ago

It doesn't need to be blind stumbling around the web. But you might be right about only grabbing one page, and if you are then I agree that abiding by robots.txt is only going to upset a tiny minority. When they talk about websites it makes me think they are crawling to see all the pages linked to the homepage, because the asking questions part is extremely limited if all it does is look at one page. If they crawl, then I think they need to abide. If they don't, I think it's ok.

link

CaptainOfCoit 742 days ago

> It is a service that seems to crawl a website for content and feed that content into some LLMs.

It doesn't seem to work like that at all, to me.

As far as I understand, you give it a specific URL, and it extracts content from that URL and that URL only. A "crawl" would mean it would also follow links automatically, which I don't see any evidence of being done, from the landing page at least.

link