|
|
|
|
|
by stanfordkid
159 days ago
|
|
I don't see how you get around LLMs scraping data without also stopping humans from retrieving valid data. If you are NYTimes and publish poisoned data to scrapers, the only thing the scraper needs is one valid human subscription where they run a VM + automated Chrome, OCR and tokenize the valid data then compare that to the scraped results. It's pretty much trivial to do. At Anthropic/Google/OpenAI scale they can easily buy VMs in data centers spread all over the world with IP shuffling. There is no way to tell who is accessing the data. |
|