| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MaDeuce 3933 days ago

Here are a couple of ideas, none of which do exactly what you want. However, they may give you some ideas...

PDFMiner[1] is a python toolkit for PDF. Among other things, it extracts text from PDF files. It also has a tool that lets you find objects and their coordinates in a PDF file. I have not looked at the latter functionality, but it may get you your words and locations.

I've used Tesseract[2] to convert scanned documents into searchable PDF files. Since a search of the PDF file will highlight matching words in the scanned document, it clearly knows where words are and the letters that comprise them. This might be another approach.

[1] https://code.google.com/p/tesseract-ocr/wiki/ReadMe [2] https://code.google.com/p/tesseract-ocr/wiki/ReadMe