Hacker News new | ask | show | jobs
by kylebgorman 3937 days ago
Just my personal opinion, but when you have an indexed copy of the whole web, a few million OCRed-but-not-corrected books from previous centuries added to your LM are not going to improve 2015 speech recognition quality.
2 comments

It would illustrate how language and ideas evolve over time. It would illustrate how language and ideas that are from different geographical sources might differ or be similar, especially during pre-Internet periods. It would provide source the material which is being referenced in contemporary works. It would provide many, many other benefits.
How many words do you think the entire web, as crawled by Google, has?
Way way more than a corpus of a few million published books, that's for sure. Hell, there are individual message boards that have higher word count than millions of books. Wikipedia arbitration cases (these aren't articles, but rather, an esoteric back channel for handling disputes between users) frequently reach novel-length.

The average quality is going to be lower, of course.

There are hundreds of thousands of words on Wikipedia about en dash, em dash, hyphen, and minus.

Here's one discussion over over ten thousand words: https://en.m.wikipedia.org/wiki/Wikipedia:Village_pump_(poli...

The least interesting thing about Mexican American War is what type of dash you use between Mexican and American. There are over twenty thousand words about that dash on wiki meta.

15,000 words would be okay if at the end of it there was some kind of consensus, or something that could be tramsfered to different articles.

The future people are going to have a skewed image of us if they think meta wiki is representative.