| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nicce 452 days ago
	It must form the search index somehow. That is prior the human action. Simply it would not find the page at all if it respects.

2 comments

pests 452 days ago

I remember in late 90s/early 2000 as a teen going to robots.txt to specifically see what they were trying to hide and exploring those urls.

What is the difference if I use a browser or a LLM tool (or curl, or wget, etc) to make those requests?

link

nicce 452 days ago

But how did you find those sites that had the robot.txt to begin with? LLM must somehow find the existence of those pages and store that information before they can crawl them further or mark as acceptable source.

link

pests 452 days ago

I am a human so I can visit other sites with links or from word of mouth or business cards or literally anywhere?

LLM finds out about it from me, when I ask it to go to the link.

You don’t accuse browsers of “somehow find[ing] the existence of those pages”. How does a browser know what page to visit?

The user tells it to.

If I prompt an LLM “go to example.net and summarize the page” how is that any different from me typing example.net in a browser URL bar?

link

nicce 451 days ago

That is certainly true. But that is not how these work 99% of the time. This post was originated by "search".

link

pests 449 days ago

I think a distinction needs to be made between ingesting for LLM training and ingesting / crawling because a human asked it to during an inference session.

I have been talking about the latter, agree the former is abusive.

link

kevindamm 452 days ago

careful, some of those are honey pots or trip wires

link

Tostino 452 days ago

Let's say you had a local model with the ability to do tool calls. You give that llm the ability to use a browser. The llm opens that browser, goes to Google or Bing, and does whatever searches it needs to do.

Why would that be an issue?

link