|
|
|
|
|
by joosters
3261 days ago
|
|
Very interesting! I had never heard of Apache PDFBox before, I must give it a try. I have a similar program that parses horse racing PDFs from sites such as www.racehorserunner.com - which are of a much simpler format, but cause endless problems for me when the PDFs have layout problems. For example, issues like one column being too long and overlapping with another, e.g the last race on http://www.racehorserunner.com/Archives/ELP/ELP170702.pdf All PDF parsers that I have tried cope very badly with these kind of situations, and often try to be 'too clever' in that they value the final layout of the text over and above the individual strings. Have you experienced similar problems with PDFBox, or does it handle formatting and layout fairly reliably? |
|
https://github.com/apache/pdfbox/blob/6f18d7c4bef4d23a22d/ex...