| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by xtx23 4416 days ago
	It is kinda interesting that it mentioned Gawker repackaging and archive without any mention of the new Timesmachine, http://timesmachine.nytimes.com, which is "a better job of resurfacing archival content."

2 comments

dredmorbius 4416 days ago

One thing I do have to give The New York Times credit for is that it's got an exceptionally good digital archive. All Web content ever posted is available online in full form.

Published articles at least through the early 20th century are indexed, typically with the lede paragraph or sentence. I'd love to have more, but that's a start.

link

xtx23 4416 days ago

if they have Google's OCR tech, it would have been much better than it is. Wonder if Google ever thought about making a cloud OCR api product. it would align with their goals.

link

dredmorbius 4415 days ago

OCR isn't even necessary. There's also The Internet Archive's BookReader which I noted recently:

https://openlibrary.org/dev/docs/bookreader

GNU Affero Licence, on GitHub:

http://github.com/openlibrary/bookreader

link

thrownaway2424 4415 days ago

Google gives away its OCR stack in the form of free software.

link

louhike 4416 days ago

Well, the goal is different. The TimesMachine is not specifically relevant to today's news. They would like now to easily provide insight on current subjects through old articles.

The format of the timesmachine might not be the best. It would be better to isolate articles and put them in a modern format.

link

xtx23 4416 days ago

yeah, makes sense. what you are saying sounds like this "http://nyti.ms/1i1l7f9" or retro reports "http://nyti.ms/1lwoYPh" that they have been trying, i think.

link