Hacker News new | ask | show | jobs
by nicce 452 days ago
It must form the search index somehow. That is prior the human action. Simply it would not find the page at all if it respects.
2 comments

I remember in late 90s/early 2000 as a teen going to robots.txt to specifically see what they were trying to hide and exploring those urls.

What is the difference if I use a browser or a LLM tool (or curl, or wget, etc) to make those requests?

But how did you find those sites that had the robot.txt to begin with? LLM must somehow find the existence of those pages and store that information before they can crawl them further or mark as acceptable source.
I am a human so I can visit other sites with links or from word of mouth or business cards or literally anywhere?

LLM finds out about it from me, when I ask it to go to the link.

You don’t accuse browsers of “somehow find[ing] the existence of those pages”. How does a browser know what page to visit?

The user tells it to.

If I prompt an LLM “go to example.net and summarize the page” how is that any different from me typing example.net in a browser URL bar?

That is certainly true. But that is not how these work 99% of the time. This post was originated by "search".
I think a distinction needs to be made between ingesting for LLM training and ingesting / crawling because a human asked it to during an inference session.

I have been talking about the latter, agree the former is abusive.

careful, some of those are honey pots or trip wires
Let's say you had a local model with the ability to do tool calls. You give that llm the ability to use a browser. The llm opens that browser, goes to Google or Bing, and does whatever searches it needs to do.

Why would that be an issue?