Hacker News new | ask | show | jobs
by fernly 434 days ago
A bit of context regarding Project Gutenberg. Its intake process is far from casual. Take a look at Project Gutenberg Distributed Proofreaders (PGDP, [0],[1]), one of the oldest "crowd-sourcing" projects on the net (est. 2000). As you can see from [0], every book goes through three rounds of proofing, where volunteers read each page of text and compare it to the scanned image; then through two rounds of format review, where other volunteers insert or review format markup.

From that 5-pass process the marked-up text is handed to a volunteer "post-processor" who assembles the final HTML or e-book file; then the completed book gets one more "smooth reading" pass before it is posted to PG.

This it the process that produces the books input to Standard Ebooks. That they can still find scanner errors ("tne" for "the", a typical "scanno") demonstrates how difficult it is to see those. But their presence isn't from carelessness or disregard for the value of the books.

In the 20-teens I put in hundreds of volunteer hours at PGDP in all the above roles, and it was very satisfying work. I'd recommend it to anyone wanting an online hobby that feels constructive. Volunteering time to Standard Ebooks would probably feel good as well.

[0] https://www.pgdp.net/c/activity_hub.php

[1] https://en.wikipedia.org/wiki/Distributed_Proofreaders

6 comments

The work done by Distributed Proofreaders is pretty amazing. I try to contribute my 35 pages as often as I can. The backlog there is pretty insane even while finishing upwards of 150 ebooks per month

it truly is an "online hobby that feels constructive". you get these tiny glimpses into our shared literary/cultural history while knowing that the work you're doing is for the benefit of all (benefit of the public domain)

> The backlog there is pretty insane even while finishing upwards of 150 ebooks per month

Isn't the backlog there mostly in the post-processing step, though? To the point where they're taking finished texts and running them again through the page-by-page proofreading in hope of fishing out more OCR typos and improving the format markup?

You can also contribute at Wikisource if you prefer, that doesn't really have a post-processing step and has much less of a fixed pipeline. (There are explicit "proofreading" and "verification" steps per page, but not much beyond that.)

In a similar vein, there is Wikisource.[0] Wikisource has the advantage of allowing for extensive formatting to closely match the source works due to its wiki-based format, but doesn't have quite as robust processes. Its flexibility is unparalleled though -- it covers virtually any form of scanned print work and even some old movies, and contributors can focus on whatever niches they're interested in if they want.

[0] https://en.wikisource.org/wiki/Main_Page

> doesn't have quite as robust processes

They do have a double-pass system for all works based on scanned pages, which is quite nifty. Green means two passes complete: https://en.m.wikisource.org/wiki/Index:Sophocles%27_King_Oed...

Plus you can just jump in to any work, in true wiki fashion.

The amount of this that could be trivially automated fills me with rage.

Even just automated flagging of common errors would save 1000s of volunteer hours.

It's unclear that that would save time. If you put in enough hours to the project, you can get classified as one of those later pass proofers. That is extremely taxing work because most of the scannos have already been found by the earlier proofers. You will "complete" multiple pages without ever finding a scanno. The doubt starts to set in if you are on auto-pilot or not.

Meanwhile, in that early stage, because of the stream of errors, it is easy to pay attention and feel like you are doing rewarding work. Moreover, if you are quite quick and diligent, you can basically just read a book as volunteer work.

Also, sometimes the error is in the source material. Different editors have different opinions about what should be done there. Sometimes I had to re-add mistakes that were "fixed" by early proofers trying to correct grammar, if I recall correctly... it was a while back that I volunteered.

> In the 20-teens

That being 2013 to 2019?

I think a lot of people (my past self included) underestimate how much meticulous, behind-the-scenes work goes into something like PGDP
out of curiosity, wouldn't an automated spell check pass help catch ocr errors? e.g. "tne" would be caught immediately.
The most confusing errors are the ones spellcheck doesn't catch because they transform a word into a valid word. But it's them that we want the least.
true, it wouldn't do a 100% job, but it would be another line of defense. the reason I was wondering about it was that the gp cited an example that was easy for humans to miss, but would be caught at once with a spell checker.

there are also statistical methods to detect words that are changed into other, valid words - check out the grammar checker in google docs for instance. again, not 100%, but every bit helps.

It would probably also throw out a lot of false positives which would take time to check. Especially in works of fiction, writers could take liberties with non-standard spelling.
Unless tne is an abbreviation and so it should pass. Names are a common place where people make up weird spellings and so spell checkers are annoying. I have terrible spelling, and yet most of the time I run spellcheck it is tripping up on words that are spelled correct but not in the dictionary (in large part because I run spell check after each revision: words spelled wrong . Add to dictionary means that my dictionary is polluted with words that only apply to one document and would be wrong in the next)
An LLM-based spellchecker would've caught it for sure. I am working on one here: https://github.com/pulkitsharma07/spelltastic.io, If anyone has suggestions on how this can help in Project Gutenberg / Standard Ebook's workflows, please reach out to me / open an issue.

I have seen that LLMs are pretty good at understanding context/domain / theme-specific terms, so their spellchecking is pretty good.

For future reference this approach was tested at https://github.com/standardebooks/tools/issues/815. No errors were found in a selection of books.
Running spellcheck is a standard step on every page of proofreading. There's a "wordcheck" button in proofing UI.
the distributed proofreaders process does include a mandatory spellcheck