Hacker News new | ask | show | jobs
by kelvin0 2298 days ago
Well I am putting the finishing touches on a front end that allows extracting PDF text visually. It's also able to adjust when the PDF page size vary for a given document type. Once you build the extractor for a document type, it can run on a batch of PDFs and store to Excel or Database (or any other format). I sense this tool facilitates and automates a lot of the 'dark art' you mention. Of course there are always difficult documents that don't fit exactly in the initial extraction paradigm, for those I use the big guns ...