Hacker News new | ask | show | jobs
by _rerr 2299 days ago
There is a fairly interesting library developed by the Stanford Team behind https://www.snorkel.org/ that takes structured documents, including PDF formatted as tables, and builds a knowledge base: https://github.com/HazyResearch/fonduer

It looks promising for these kinds of daunting tasks

1 comments

One of the co-authors of Fonduer here. Just for reference the original paper for Fonduer is here:

https://dl.acm.org/doi/pdf/10.1145/3183713.3183729

And additional follow-up work on extracting data from PDF datasheets is here:

https://dl.acm.org/doi/pdf/10.1145/3316482.3326344

One thing to point out about our library is that while we do take PDF as input and use it to calculate visual features, we also rely on an HTML representation of the PDF for structural cues. In our pipeline this is typically done by using Adobe Acrobat to generate an HTML representation for each input PDF.

What type of visual features are you looking at? I've been trying to find a web-clipper that uses both visual and structural cues from the rendered page and HTML, but have no luck finding a good starting point.
There are a handful. We looks at bounding boxes to featurize which spans are visually aligned with other spans. Which page a span is on, etc. You can see more in the code at [1]. In general, visual features seem to give some nice redundancy to some of the structural features of HTML, which helps when dealing with an input as noisy as PDF.

[1]: https://github.com/HazyResearch/fonduer/tree/master/src/fond...