But how did you find those sites that had the robot.txt to begin with? LLM must somehow find the existence of those pages and store that information before they can crawl them further or mark as acceptable source.
I think a distinction needs to be made between ingesting for LLM training and ingesting / crawling because a human asked it to during an inference session.
I have been talking about the latter, agree the former is abusive.
Let's say you had a local model with the ability to do tool calls. You give that llm the ability to use a browser. The llm opens that browser, goes to Google or Bing, and does whatever searches it needs to do.
What is the difference if I use a browser or a LLM tool (or curl, or wget, etc) to make those requests?