| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by causality0 1467 days ago
	PG does great work and we rely on them almost exclusively for transcriptions Until I got to this part of the comment I was thinking "Yay, an alternative to PG's godawful OCR transcriptions". Why would you reuse the worst part of Project Gutenberg?

1 comments

jxramos 1467 days ago

It's a starting point is what I think they're getting at. Preclassification which a human then corrects--we're effectively talking about a labor saving device for an otherwise tedious task.

link

AdmiralAsshat 1467 days ago

Can confirm. Think of it like using an AI to do an initial pass at a conference transcription and then correcting the typos, rather than doing the whole transcription by hand. Even if it's only 85% accurate, you've still saved a boatload of time.

When I did "The Valley of Fear" as my first project, the PG text was used as the base, but if I encountered any kind of ambiguity in the text, I consulted at least a half-dozen other versions of the text via Google Books scans for agreement.

The team is also very particular about only using editions that have entered into the public domain. So if the first edition of a book just entered public domain, you must make sure that what you have produced only uses text from the first edition, and that you haven't inadvertently used a later edition as a base that may have included subsequent editorial changes.

link

causality0 1467 days ago

So they're actually reading the texts and correcting the mistakes?

link

acabal 1467 days ago

Yes - that's one of the main points of the project!

link

jxramos 1460 days ago

I'm curious what tooling folks use to accelerate this process, has anyone written custom GUI stuff like tesseract box editor?

link

baobabKoodaa 1467 days ago

Hmm, I'm fairly confident a large chunk of this work could be automated (correcting OCR errors). I would be happy to take a shot at this problem as a volunteer, if you're open to the idea?

link

hombre_fatal 1466 days ago

It’s not, because primary scans have arbitrary quality. Better OCR tech will spare you corrections but not from comparing the scan which is the big fixed cost whether it’s to correct 1000 errors or 10 errors.

link