| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by _rerr 2299 days ago
	There is a fairly interesting library developed by the Stanford Team behind https://www.snorkel.org/ that takes structured documents, including PDF formatted as tables, and builds a knowledge base: https://github.com/HazyResearch/fonduer It looks promising for these kinds of daunting tasks

1 comments

lwhsiao 2298 days ago

One of the co-authors of Fonduer here. Just for reference the original paper for Fonduer is here:

https://dl.acm.org/doi/pdf/10.1145/3183713.3183729

And additional follow-up work on extracting data from PDF datasheets is here:

https://dl.acm.org/doi/pdf/10.1145/3316482.3326344

One thing to point out about our library is that while we do take PDF as input and use it to calculate visual features, we also rely on an HTML representation of the PDF for structural cues. In our pipeline this is typically done by using Adobe Acrobat to generate an HTML representation for each input PDF.

link

bhl 2298 days ago

What type of visual features are you looking at? I've been trying to find a web-clipper that uses both visual and structural cues from the rendered page and HTML, but have no luck finding a good starting point.

link

lwhsiao 2294 days ago

There are a handful. We looks at bounding boxes to featurize which spans are visually aligned with other spans. Which page a span is on, etc. You can see more in the code at [1]. In general, visual features seem to give some nice redundancy to some of the structural features of HTML, which helps when dealing with an input as noisy as PDF.

[1]: https://github.com/HazyResearch/fonduer/tree/master/src/fond...

link