Google surfaces data — or it used to — LLMs and AI companies actively exploit it with zero benefit given to creators or users of the platforms they're now cannibalizing.
the irony. im surprised how businesses built on selling google search results is allowed to exist. i guess for the same reason google scraping the internet and building a product on top of it is allowed.
then it only makes sense scraped AI training data is also going to be tolerated because you would need to reproduce a large language model like ChatGPT using your copyrighted content can produce a similar derivative of your copyrighted content by doing forensic analysis.
its such an uphill battle for copyright holders. They need to replicate: copyrighted input ---> LM similar to ChatGPT4 ---> copyrighted output
So far its not looking good for OpenAI because its possible to generate copyrighted output (type spiderman in czech) so all that remains is demonstrating the middle layer (training it on LM similar to ChatGPT4) but that is unrealistically expensive.
I have theory that all this money spent on large models is to make it impossible for discovery (as it would require access to $100 billion GPUs)
The whole notion that AI can replace search is nonsense. It yields no benefit to the creators of the results it scrapes and the models hallucinate. It's worse for users and it's worse for everyone producing anything of note online.
Google search is terrible. Chatgpt is definitively better for searching right now, and i often find myself reaching for it over google for a wide category of questions.
Google search is terrible because Google's stopped caring about search quality in favor of monetization. It doesn't mean an LLM can outperform a traditional search engine that cares about said quality.
then it only makes sense scraped AI training data is also going to be tolerated because you would need to reproduce a large language model like ChatGPT using your copyrighted content can produce a similar derivative of your copyrighted content by doing forensic analysis.
its such an uphill battle for copyright holders. They need to replicate: copyrighted input ---> LM similar to ChatGPT4 ---> copyrighted output
So far its not looking good for OpenAI because its possible to generate copyrighted output (type spiderman in czech) so all that remains is demonstrating the middle layer (training it on LM similar to ChatGPT4) but that is unrealistically expensive.
I have theory that all this money spent on large models is to make it impossible for discovery (as it would require access to $100 billion GPUs)