Hacker News new | ask | show | jobs
by pvitz 1152 days ago
I would like to extract text from approximately 2000 PDF files (machine generated, not scanned) in which the layout can be different on a file basis. Some have normal paragraphs, others two columns and even three columns. All contain tables, but I am not interested in them. Do you know a good (semi-)automatic solution for this?
1 comments

this is a hard problem and will require an enterprise solution unfortunately. If its only 2000 pdfs you might be better outsourcing to an off-shore consulting agency to do it manually
Thanks for the reply, good to know that!