One thing to point out about our library is that while we do take PDF as input and use it to calculate visual features, we also rely on an HTML representation of the PDF for structural cues. In our pipeline this is typically done by using Adobe Acrobat to generate an HTML representation for each input PDF.
What type of visual features are you looking at? I've been trying to find a web-clipper that uses both visual and structural cues from the rendered page and HTML, but have no luck finding a good starting point.
There are a handful. We looks at bounding boxes to featurize which spans are visually aligned with other spans. Which page a span is on, etc. You can see more in the code at [1]. In general, visual features seem to give some nice redundancy to some of the structural features of HTML, which helps when dealing with an input as noisy as PDF.
https://dl.acm.org/doi/pdf/10.1145/3183713.3183729
And additional follow-up work on extracting data from PDF datasheets is here:
https://dl.acm.org/doi/pdf/10.1145/3316482.3326344
One thing to point out about our library is that while we do take PDF as input and use it to calculate visual features, we also rely on an HTML representation of the PDF for structural cues. In our pipeline this is typically done by using Adobe Acrobat to generate an HTML representation for each input PDF.