Hacker News new | ask | show | jobs
Transcribing the Phyllis Diller Gag File (transcription.si.edu)
16 points by reedk 3399 days ago
4 comments

It's basically all typeprinter font and expertly scanned, what exactly is the issue in using even basic OCR?
The Smithsonian is transcribing other, much more difficult works. Such as the cursive lab notebook of a historic astrophysicist[0]. I am ridiculously jealous that they are getting this sort of crowd-sourced help to clean data.

Compare to hampanda.com (from Deepgram, YC W16)

[0] https://transcription.si.edu/transcribe/8634/ECOFD

I was wondering this too, there are typographic errors on the cards that are being transcribed verbatim one

"What is the different between a blond and a bruck[sic]?"

"After you lay a brick it does'nt[sic] follow you around for a week."

From the perspective of the artifact I could imagine that having the typos there would be reasonable but from the perspective of searchability it doesn't make a lot of sense to me.

Interesting project, but so many design problems with the approach to involve users.

Use OCR first, then use humans to verify.

Next, present a task right up front that anyone can help with -- draw people right in. Don't make users "look for work" and minimize/eliminate the need for training.

For example: "If there's a date shown, enter it here ______" (with an option for "no date").

Or, "Correct this text as it appears on the card: ________"

Or, "Is there an attribution/credit mentioned? If so, enter it here ____________________"

etc.

If you're ever in Washington D.C. you can view Bob Hope's joke file at the Library of Congress where there's a special exhibit on him.

Hope's career started in Vaudeville, then radio, the movies and finally TV. Interestingly he did several movies with Phyllis Diller and she was on a lot of his TV specials.

https://www.loc.gov/exhibits/bobhope/jokes.html

It is a nice peek at the hard work behind the scenes that goes into being a successful comedian.