| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by grepherder 5219 days ago
	The tip is of course valid and pro, and I'd recommend the same, but it's already being done, under machine translation. Also, in this area big data loses its meaning, as you don't really need traditional databases, you just process raw text. There are literally thousands researching how to intelligently select and process this data.

1 comments

gliese1337 5219 days ago

It's not just machine translation; it's image processing / cleanup (to handle huge amounts of data for multispectral imaging and figure out how to combine it into sets of false-color images that people can read), optical character recognition (for ancient handwriting in weird writing systems), system-level programming to run the scanners, etc. There's a big ol' book on this, "Rome Wasn't Digitized in a Day": http://www.clir.org/pubs/reports/pub150/pub150.pdf BYU (which I attend and whom I work for) has done a huge amount of work in this field: http://maxwellinstitute.byu.edu/about/cpart.php

A few years ago I was writing web applications to support transcription of images of medieval documents in Old French- avoiding close-to-insurmountable OCR problems using grad students, but that still requires segmenting images properly. The LDS church does similar stuff on a very large scale to digitize genealogical records. It makes research a whole lot easier, but there's still plenty of room for improvement; image maps don't always reliably match up with the fields that you're trying to read/transcribe on images of documents, and that's kind of a pain.

link

TheAmazingIdiot 5218 days ago

What we need here are true eyeballs to read the scripts.

I do medieval and renaissance dance reconstruction and dance performance. Having just been to an event, I took a class on the Dances of the Gresley Manuscript.

Well, what is this manuscript? It isn't a dance treatise, or anything of the sort. Gresley was a law student from the 1530-1550's (we know from latter court cases by a lawyer Gresley). These dance instructions come from the margins of his law book.

He wrote in musical notation, dance notation and other descriptive words. He even left words that have no meaning in the dance community. We have to deduce what he meant by a multitude of methods, none of which we can guarantee.

But back to the topic of OCR... How does these document scanners and OCR's plan to deduce this kind of source written in the margins?

link

gliese1337 5218 days ago

    But back to the topic of OCR... How does these document scanners and OCR's plan to deduce this kind of source written in the margins?

I have no idea; probably they don't, yet. Everything I've worked on uses students' eyeballs to do the actual character recognition, so I'm not deeply familiar with the state of the art. I do know that OCR is mainly used for documents that have a well-defined structure where you can make an image map identifying different semantic fields, and the contextual field information allows for much more intelligent OCR; it's not so good for big blocks of paragraph text.

When you get to figuring out stuff scrawled in the margins, there are image pre-processing techniques that can identify regions of handwriting and then normalize it by rotation and scaling, but I'm pretty sure a complete solution is still in the realm of stuff considered AI (because, of course, once you know how to do it reliably, it becomes machine learning or pattern matching or something like that and no one calls it AI anymore).

link