|
|
|
|
|
by coffeecat
2561 days ago
|
|
I'm curious if you considered or attempted to use pdfplumber's table extraction methods to separate tabular from non-tabular text. That would be my starting point on a problem like this, as picking the relevant rows of a table is far easier than picking from the set of all tokens. By the way, when you say tokens, you're referring to non-whitespace characters separated by whitespace? How reliable have you found pdfplumber to be in picking out words/tokens? |
|
pdfplumber seems mostly ok at extracting tokens. Sometimes it seems to combine tokens that should be separate. I suspect a few percent of the error is actually problems earlier in the data pipeline, as opposed to the model proper.