Hacker News new | ask | show | jobs
by xtx23 4416 days ago
It is kinda interesting that it mentioned Gawker repackaging and archive without any mention of the new Timesmachine, http://timesmachine.nytimes.com, which is "a better job of resurfacing archival content."
2 comments

One thing I do have to give The New York Times credit for is that it's got an exceptionally good digital archive. All Web content ever posted is available online in full form.

Published articles at least through the early 20th century are indexed, typically with the lede paragraph or sentence. I'd love to have more, but that's a start.

if they have Google's OCR tech, it would have been much better than it is. Wonder if Google ever thought about making a cloud OCR api product. it would align with their goals.
OCR isn't even necessary. There's also The Internet Archive's BookReader which I noted recently:

https://openlibrary.org/dev/docs/bookreader

GNU Affero Licence, on GitHub:

http://github.com/openlibrary/bookreader

Google gives away its OCR stack in the form of free software.
Well, the goal is different. The TimesMachine is not specifically relevant to today's news. They would like now to easily provide insight on current subjects through old articles.

The format of the timesmachine might not be the best. It would be better to isolate articles and put them in a modern format.

yeah, makes sense. what you are saying sounds like this "http://nyti.ms/1i1l7f9" or retro reports "http://nyti.ms/1lwoYPh" that they have been trying, i think.