| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ben_w 546 days ago

> To your point, I have wondered whatever became of that massive initiative from Google to scan books, and whether that might be looked at as a potential training source, giving that Google has run into legal limitations on other forms of usage.

Still around, doing fine: https://en.wikipedia.org/wiki/Google_Books and https://books.google.com/intl/en/googlebooks/about/index.htm...

Given the timing, I suspect it was started as simple indexing, in keeping with the mission statement "Organize the world's information and make it universally accessible and useful".

There was also reCAPTCHA v1 (books) and v2 (street view), which each improved OCR AI until the state of the art AI were able to defeat them in the role of CAPTCHA systems.

1 comments

glenstein 546 days ago

I don't know what you mean by timing (relative to what?) or "simple indexing" (they scanned the complete contents of books), but I am, and was already aware, of the wiki article and the role of recaptcha.

Maybe I wasn't clear, but I was interested in the consequences of the legal stuff. It's not clear from the wiki article what any of this means with respect to the suitability of scans for AI training.

link

ben_w 546 days ago

> I don't know what you mean by timing (relative to what?) or "simple indexing" (they scanned the complete contents of books), but I am, and was already aware, of the wiki article and the role of recaptcha.

Timing as in: it started in 2004, when the most advanced AI most people used was a spam filter, so it wasn't seen as a training issue (in the way that LLMs are) *at the time*.

As for training rights, I agree with you, there's no clarity for how such data could be used *today* by the people who have it. Especially as the arguments in favour of LLM training are often by comparison to search engine indexing.

link

fragmede 546 days ago

Until such time as a lawsuit declares otherwise, Google's position is obviously that scanning books, OCRing them, saving that text in a database, and using that to allow searching is no different, legally, than scanning books, OCRing them, saving that text in to a database, and using that to train LLMs. Book publishers already went up against Google for the practice of scanning in the first place, we'll see if they try again with LLM training.

link