Hacker News new | ask | show | jobs
by UglyToad 2340 days ago
I had to check we hadn't worked for the same company! Yeah, text extraction and layout analysis from PDFs is a super interesting challenge and still relatively underdeveloped. I'd put table detection at about the hardest challenge in that field.

One of the contributors to the PDF library I'm developing has been implementing some interesting algorithms for layout analysis https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Anal...

3 comments

Really really interesting, hadn't seen pdfpig before!

In the delicious pics of results I can see the bullets treated as one column and the paragraphs for each bullet point actually run together as single chunk of text?

What do you think about tackling bullets and indents?

Thanks! I think there's definitely room for rules-based enhancements to the underlying algorithms.

My area of work on the project has been the core file-reading and file-creation stuff so I haven't had much of a chance to review the layout algorithm performance across documents.

Having been working on a purely rules-based approach in a private repository for a side project it seems like the algorithms the contributor has implemented get you a lot closer to the correct result than starting from rules alone but it definitely feels like adding some context-aware rules would get all the way there. I'm not sure whether they'd be in scope for the layout analysis project itself or someone could take the open-core and extend it, as I was attempting in my side project.

It depends. We do commercial pdf and scanned information extraction as well as table detection for line items for invoices, receipts and remittance slips. We have been successfully using rule-based system for years but are mixing in deep learning now. I also know a about 5 other companies competing in the same field. So, I wouldn't say it is underdeveloped.
Referring to the above poster's "non-locality", are we talking about denormalization of formatting? Is there a way to "normalize" PDF structure? Calculate margins or common formats beforehand to normalize?
I believe the reference is to logical locality, specifically in the case of PDF that transforms and such are essentially atomic and there's no real boundary layer in which you may say "transform X and transform Y are equivalent within this local finite domain."

There really is no real differentiation between formatting and content in a PDF, so it's not possible to truly separate them.

I'll try my best to answer but I may be misunderstanding the question.

The current layout analysis algorithms don't do much normalization as far as I'm aware, the Recursive-XY Cut algorithm uses page level font-size information [0] to tune parameters but it doesn't infer a common structure or format either as an input or result.

The aim of most layout analysis algorithms is to produce classifications for regions, e.g. paragraphs, titles, lists which I suppose counts as denormalizing the document? Arriving at those classifications generally relies on first splitting the document into sections or regions and then classifying those regions. So far the implemented algorithms mainly focus on the first step, splitting a document into discrete regions. An example of the second step using ML approaches to classify those regions by the same contributor can be found here [1].

With the rule based approaches I've been experimenting with you can use certain information from the common producers to normalize certain features. For example line spacing and font size have a well defined relationship, as do whitespace size and font size (though this is a fuzzier relationship and goes out the window entirely for justified text).

An example where you rely on non-locality to parse a document, in this SEC filing there are both key values and a table: https://www.sec.gov/Archives/edgar/data/1428796/000110465920...

For the values following the subheading "Institutional Investment Manager Filing this Report:" the left hand column are keys for the right hand values.

At the bottom of the document there's a table containing the columns "Form 13F File Number" and "Name".

Now you could use a couple of rules to infer the difference between the key-values and the table:

1) The keys in a key value list end in ':'.

2) The keys in a key value list have a different font/color to the values.

Both of those rules hold true here but not in all or even most documents. For this reason you need to use the whole page to deduce the type of these sections, rather than immediately surrounding features/pixels as an ML algorithm might.

[0]: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad....

[1]: https://github.com/BobLd/PdfPigMLNetBlockClassifier