Hacker News new | ask | show | jobs
by lovelearning 3620 days ago
The file format itself has all the information required to extract text from a rectangular area. Frameworks like PDFBox and iText have supported it from a long time.

It's upto users to define what are rows and columns. In most programmatically generated PDFs, this is easy. But in manually typeset PDFs, there are lots of edge cases like variable row heights or column widths, slanted table borders, stuff like that.

1 comments

That's right! The user defines a rectangular area and we then extract the raw text based on the position. For table extraction we use tabula.java under the hood.