|
|
|
|
|
by grepherder
5172 days ago
|
|
The tip is of course valid and pro, and I'd recommend the same, but it's already being done, under machine translation. Also, in this area big data loses its meaning, as you don't really need traditional databases, you just process raw text. There are literally thousands researching how to intelligently select and process this data. |
|
A few years ago I was writing web applications to support transcription of images of medieval documents in Old French- avoiding close-to-insurmountable OCR problems using grad students, but that still requires segmenting images properly. The LDS church does similar stuff on a very large scale to digitize genealogical records. It makes research a whole lot easier, but there's still plenty of room for improvement; image maps don't always reliably match up with the fields that you're trying to read/transcribe on images of documents, and that's kind of a pain.