| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lwhsiao 2305 days ago

One of the co-authors of Fonduer here. Just for reference the original paper for Fonduer is here:

https://dl.acm.org/doi/pdf/10.1145/3183713.3183729

And additional follow-up work on extracting data from PDF datasheets is here:

https://dl.acm.org/doi/pdf/10.1145/3316482.3326344

One thing to point out about our library is that while we do take PDF as input and use it to calculate visual features, we also rely on an HTML representation of the PDF for structural cues. In our pipeline this is typically done by using Adobe Acrobat to generate an HTML representation for each input PDF.

1 comments

bhl 2305 days ago

What type of visual features are you looking at? I've been trying to find a web-clipper that uses both visual and structural cues from the rendered page and HTML, but have no luck finding a good starting point.

link

lwhsiao 2300 days ago

There are a handful. We looks at bounding boxes to featurize which spans are visually aligned with other spans. Which page a span is on, etc. You can see more in the code at [1]. In general, visual features seem to give some nice redundancy to some of the structural features of HTML, which helps when dealing with an input as noisy as PDF.

[1]: https://github.com/HazyResearch/fonduer/tree/master/src/fond...

link