Hacker News new | ask | show | jobs
by doctor_blood 52 days ago
Small world - I'm currently cleaning up scans of the EB 9th edition to put it online as a mediawiki site; I'm including all the illustrations and plates so I'm only a third of the way through.

I've been testing different OCR tools and so far I've been the most impressed with paddleOCR - it correctly split the text columns, labled the illustrations, and noted the maragin text.

Still, it's not perfect, so I'm having to hand-edit some tables. I plan to put the source pages online as well so you can switch between the scanned page and the electronic text.

3 comments

For those unfamiliar, the 1875 9th ed. was known as the scholar's edition due to how many eminent persons had contributed; it's a fascinating snapshot of the late 1800s.

Other material that would be fun to put online in a hyperlinked and indexed format include geographic and medical atlases and the Baedeker travel guides.

I'm looking forward to it. The 9th is great in its own right and a lot of it is in the 11th. Alfred Newton's nearly 200 articles on bird species and a few classic essays by Macaulay come to mind offhand.
re: OCR of tables, would the work done on https://github.com/tabulapdf/tabula / https://tabula.technology/ be relevant?