| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by glenstein 583 days ago
	>Bill Gross correctly calls this phase of AI shoplifting. I call it the Napster-of-Everything (because I am old). I am also betting that the courts won't buy the "fair use" interpretation of scraping, given the revenues AI companies generate. That means a potential stalling of new models until some mechanism is worked out to pay knowledge creators. To your point, I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

2 comments

ben_w 583 days ago

> To your point, I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

Still around, doing fine: https://en.wikipedia.org/wiki/Google_Books and https://books.google.com/intl/en/googlebooks/about/index.htm...

Given the timing, I suspect it was started as simple indexing, in keeping with the mission statement "Organize the world's information and make it universally accessible and useful".

There was also reCAPTCHA v1 (books) and v2 (street view), which each improved OCR AI until the state of the art AI were able to defeat them in the role of CAPTCHA systems.

link

glenstein 583 days ago

I don't know what you mean by timing (relative to what?) or "simple indexing" (they scanned the complete contents of books), but I am, and was already aware, of the wiki article and the role of recaptcha.

Maybe I wasn't clear, but I was interested in the consequences of the legal stuff. It's not clear from the wiki article what any of this means with respect to the suitability of scans for AI training.

link

ben_w 583 days ago

> I don't know what you mean by timing (relative to what?) or "simple indexing" (they scanned the complete contents of books), but I am, and was already aware, of the wiki article and the role of recaptcha.

Timing as in: it started in 2004, when the most advanced AI most people used was a spam filter, so it wasn't seen as a training issue (in the way that LLMs are) *at the time*.

As for training rights, I agree with you, there's no clarity for how such data could be used *today* by the people who have it. Especially as the arguments in favour of LLM training are often by comparison to search engine indexing.

link

fragmede 583 days ago

Until such time as a lawsuit declares otherwise, Google's position is obviously that scanning books, OCRing them, saving that text in a database, and using that to allow searching is no different, legally, than scanning books, OCRing them, saving that text in to a database, and using that to train LLMs. Book publishers already went up against Google for the practice of scanning in the first place, we'll see if they try again with LLM training.

link

pncnmnp 583 days ago

> I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

A few months ago, there was an interesting submission on HN about this - The Tragedy of Google Books (2017) (https://news.ycombinator.com/item?id=41917016).

link