Hacker News new | ask | show | jobs
by coffeecat 2561 days ago
I'm curious if you considered or attempted to use pdfplumber's table extraction methods to separate tabular from non-tabular text. That would be my starting point on a problem like this, as picking the relevant rows of a table is far easier than picking from the set of all tokens. By the way, when you say tokens, you're referring to non-whitespace characters separated by whitespace? How reliable have you found pdfplumber to be in picking out words/tokens?
1 comments

I didn’t try separating out tables because the total field isn’t actually “inside” the table in many cases. Certainly the other fields I want are not.

pdfplumber seems mostly ok at extracting tokens. Sometimes it seems to combine tokens that should be separate. I suspect a few percent of the error is actually problems earlier in the data pipeline, as opposed to the model proper.