Hacker News new | ask | show | jobs
by pests 457 days ago
I remember in late 90s/early 2000 as a teen going to robots.txt to specifically see what they were trying to hide and exploring those urls.

What is the difference if I use a browser or a LLM tool (or curl, or wget, etc) to make those requests?

2 comments

But how did you find those sites that had the robot.txt to begin with? LLM must somehow find the existence of those pages and store that information before they can crawl them further or mark as acceptable source.
I am a human so I can visit other sites with links or from word of mouth or business cards or literally anywhere?

LLM finds out about it from me, when I ask it to go to the link.

You don’t accuse browsers of “somehow find[ing] the existence of those pages”. How does a browser know what page to visit?

The user tells it to.

If I prompt an LLM “go to example.net and summarize the page” how is that any different from me typing example.net in a browser URL bar?

That is certainly true. But that is not how these work 99% of the time. This post was originated by "search".
I think a distinction needs to be made between ingesting for LLM training and ingesting / crawling because a human asked it to during an inference session.

I have been talking about the latter, agree the former is abusive.

careful, some of those are honey pots or trip wires