Hacker News new | ask | show | jobs
by ching_wow_ka 3937 days ago
I can say pretty certainly that all the text they've gathered through the Google Books project is in use in their language models and other AI models for their search engine, speech recognition, etc.

They got what they wanted. I can't see what incentive they have as a business to grant access to the books that justifies paying employees for it.

2 comments

Just my personal opinion, but when you have an indexed copy of the whole web, a few million OCRed-but-not-corrected books from previous centuries added to your LM are not going to improve 2015 speech recognition quality.
It would illustrate how language and ideas evolve over time. It would illustrate how language and ideas that are from different geographical sources might differ or be similar, especially during pre-Internet periods. It would provide source the material which is being referenced in contemporary works. It would provide many, many other benefits.
How many words do you think the entire web, as crawled by Google, has?
Way way more than a corpus of a few million published books, that's for sure. Hell, there are individual message boards that have higher word count than millions of books. Wikipedia arbitration cases (these aren't articles, but rather, an esoteric back channel for handling disputes between users) frequently reach novel-length.

The average quality is going to be lower, of course.

There are hundreds of thousands of words on Wikipedia about en dash, em dash, hyphen, and minus.

Here's one discussion over over ten thousand words: https://en.m.wikipedia.org/wiki/Wikipedia:Village_pump_(poli...

The least interesting thing about Mexican American War is what type of dash you use between Mexican and American. There are over twenty thousand words about that dash on wiki meta.

15,000 words would be okay if at the end of it there was some kind of consensus, or something that could be tramsfered to different articles.

The future people are going to have a skewed image of us if they think meta wiki is representative.

Gather data on search queries and highlighted phrases for books? There is value in knowing which subset of the corpus is more valuable.

Apple acquired a "Pandora for Books" recommendation startup which had permission from publishers, who provided text for indexing, http://www.businessinsider.com/apple-buys-booklamp-2014-7 . Their machine classification made it possible to search books for topics whose words were not present in the book.