| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pvitz 1152 days ago
	I would like to extract text from approximately 2000 PDF files (machine generated, not scanned) in which the layout can be different on a file basis. Some have normal paragraphs, others two columns and even three columns. All contain tables, but I am not interested in them. Do you know a good (semi-)automatic solution for this?

1 comments

exhibitapp 1151 days ago

this is a hard problem and will require an enterprise solution unfortunately. If its only 2000 pdfs you might be better outsourcing to an off-shore consulting agency to do it manually

link

pvitz 1151 days ago

Thanks for the reply, good to know that!

link