| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jonathanstray 2561 days ago
	Heh. It’s my work, so maybe I can clarify the goals. This is meant to be a proof of concept. I took a week and was able to show that relatively simple deep learning techniques are capable of generalizing over unseen form types with high accuracy. I also showed that tokens-plus-geometry is a viable format, and that hand-crafted feature engineering is still necessary (and still used in SOTA approaches). I also believe that preparing and cleaning this data set, and bringing a challenging investigative journalism problem to the attention of other researchers, would be valuable even if I hadn’t done any work on this baseline solution. This is a problem that journalists currently expend a huge amount of time and money on, which reduces the effectiveness of transparency around political ad spending information.

2 comments

coffeecat 2561 days ago

I'm curious if you considered or attempted to use pdfplumber's table extraction methods to separate tabular from non-tabular text. That would be my starting point on a problem like this, as picking the relevant rows of a table is far easier than picking from the set of all tokens. By the way, when you say tokens, you're referring to non-whitespace characters separated by whitespace? How reliable have you found pdfplumber to be in picking out words/tokens?

link

jonathanstray 2561 days ago

I didn’t try separating out tables because the total field isn’t actually “inside” the table in many cases. Certainly the other fields I want are not.

pdfplumber seems mostly ok at extracting tokens. Sometimes it seems to combine tokens that should be separate. I suspect a few percent of the error is actually problems earlier in the data pipeline, as opposed to the model proper.

link

steve19 2561 days ago

Really enjoyed the video, thank you for sharing.

I think the parents critism was more that the article was a little light compared to thr video. For example, you didn't have screenshots of the scanned pdfs in the article.

link