Hacker News new | ask | show | jobs
by nicce 460 days ago
But how did you find those sites that had the robot.txt to begin with? LLM must somehow find the existence of those pages and store that information before they can crawl them further or mark as acceptable source.
1 comments

I am a human so I can visit other sites with links or from word of mouth or business cards or literally anywhere?

LLM finds out about it from me, when I ask it to go to the link.

You don’t accuse browsers of “somehow find[ing] the existence of those pages”. How does a browser know what page to visit?

The user tells it to.

If I prompt an LLM “go to example.net and summarize the page” how is that any different from me typing example.net in a browser URL bar?

That is certainly true. But that is not how these work 99% of the time. This post was originated by "search".
I think a distinction needs to be made between ingesting for LLM training and ingesting / crawling because a human asked it to during an inference session.

I have been talking about the latter, agree the former is abusive.