This perspective is more cut and dry when its someone like OpenAI scraping the whole internet explicitly for LLM training purposes. But Google has already been scraping the entire internet for 25+ years. At what point did building a smarter search engine transition from indexing, to 'robbing'? And it's not like training Gemini is the first time they used their internet cache to build AI. AI, as academics use the term, has been in use on Google results for a long time.
Basically, if we were okay with Google scraping the internet to build a search index, what is the line they crossed that turned this from acceptable search engine indexing, into theft?
It is more like imitating the imitators. There is not much of a legal case here, but poisoning the data is fair game both for those producing original data as well as for those producing its regurgitations.
I think its very hard for the 'websites' to poison the data for ai though, we dont have the 'single point of ingestion' to measure when its being pumped for training data.
Basically, if we were okay with Google scraping the internet to build a search index, what is the line they crossed that turned this from acceptable search engine indexing, into theft?