| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Normati 4160 days ago

"The text files were created by manually keying the full text of each work, based on millions of digital facsimile page images"

!!! This is not silicon valley. I wonder how they ensure accuracy.

Link to the books http://ota.ox.ac.uk/tcp/

3 comments

rmc 4160 days ago

Hundreds of gradstudents.

link

th0br0 4160 days ago

Well, maybe they used Amazon Mechanical Turk ;)

link

jbaiter 4160 days ago

I know from a similar German project (http://www.deutschestextarchiv.de) and they have two independent non-German speakers transcribe the digital facsimiles to ensure that the transcriptions are as accurate as possible.

link

arocks 4160 days ago

I highly doubt that. One of the text [1] starts with the line:

> TO THE RIGHT VVORSHIPFVLL MAISTER RObert Clarke,

The mistakes look like typical OCR errors.

[1]: http://tei.it.ox.ac.uk/tcp/Texts-HTML/free/A01/A01716.html

link

philers 4160 days ago

In fact, those mistakes look more like accurate transcriptions of Early Modern manuscripts - with their looser spelling rules and often idiosyncratic use of letters.

It's kind of interesting that they look like the same errors as those generated by OCR.

The difficulty of deciphering the text makes this huge task even more impressive!

link

coroxout 4160 days ago

It's precisely those idiosyncrasies of early modern orthography which make it difficult to use an off-the-shelf OCR package, which is presumably why these are hand-transcribed instead.

Perhaps there is a specialist antiquarian OCR package which can deal with long s, interchangeable u and v, non-standardised spelling, etc, but I have yet to come across one.

link

acdha 4160 days ago

Have you looked at The Early Modern OCR project? My understanding is that they're working on exactly that as well as simply better tools for reviewing & retraining on a large scale:

http://emop.tamu.edu/

link

coroxout 4159 days ago

No, I hadn't, and am grateful for the link - thank you!

link