Hacker News new | ask | show | jobs
by axg11 1238 days ago
A key detail a lot of people are missing about "traditional" search vs. ChatGPT style search:

ChatGPT/LLMs can essentially crawl _anything_ they want, regardless of legality, license, consent, etc. These models are trained on anything that can be ingested. Once trained, you can release the model with plausible deniability. There's no 1:1 relation between ingested content and outputs. LLMs that "cheat" by ingesting content they shouldn't have will have an advantage over those that don't.

Google and other search engines don't have this luxury. If they serve a result, they have to make sure that they're not violating any license. If they crawl the wrong content, they have to make sure they don't serve it.

3 comments

Google and other search engines also crawl anything they want. If it is accessible on the internet, it is fair game. There have been countless disputes about linking to copyrighted content, posting blurbs etc., and Google has mostly won all of them with the fair use argument.

If Google removed copyrighted content from its index, the results would be Wikipedia and...not much else.

This isn't true. Search engines rely on fair use: https://www.everycrsreport.com/reports/RL33810.html
Agreed. Sites like GitHub actively monitor for unwanted crawling/scraping and throttle requests.
> ChatGPT/LLMs can essentially crawl _anything_ they want, regardless of legality, license, consent, etc. These models are trained on anything that can be ingested. Once trained, you can release the model with plausible deniability

We will see. The idea that ML models contain the mere creative essence and are generative from something that cannot be copyrighted is not one that has been tested in court.

I personally am not convinced: My own experiments with prompt-stuffing GPT definitely seem to reveal corpus.

I am reminded of a story of how billg would type a command into basic computers at trade shows to "reveal" that it contained microsoft-copyrighted code (gotcha!).

I imagine if someone did that in front of a judge it would be game-over.