Hacker News new | ask | show | jobs
by jonathanstray 2561 days ago
Heh. It’s my work, so maybe I can clarify the goals. This is meant to be a proof of concept. I took a week and was able to show that relatively simple deep learning techniques are capable of generalizing over unseen form types with high accuracy. I also showed that tokens-plus-geometry is a viable format, and that hand-crafted feature engineering is still necessary (and still used in SOTA approaches).

I also believe that preparing and cleaning this data set, and bringing a challenging investigative journalism problem to the attention of other researchers, would be valuable even if I hadn’t done any work on this baseline solution. This is a problem that journalists currently expend a huge amount of time and money on, which reduces the effectiveness of transparency around political ad spending information.

2 comments

I'm curious if you considered or attempted to use pdfplumber's table extraction methods to separate tabular from non-tabular text. That would be my starting point on a problem like this, as picking the relevant rows of a table is far easier than picking from the set of all tokens. By the way, when you say tokens, you're referring to non-whitespace characters separated by whitespace? How reliable have you found pdfplumber to be in picking out words/tokens?
I didn’t try separating out tables because the total field isn’t actually “inside” the table in many cases. Certainly the other fields I want are not.

pdfplumber seems mostly ok at extracting tokens. Sometimes it seems to combine tokens that should be separate. I suspect a few percent of the error is actually problems earlier in the data pipeline, as opposed to the model proper.

Really enjoyed the video, thank you for sharing.

I think the parents critism was more that the article was a little light compared to thr video. For example, you didn't have screenshots of the scanned pdfs in the article.