| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by spacecaps 444 days ago

I was irritated that I couldn't extract data from PDFs in a similar way to web pages + BeautifulSoup, so I built a library that (kind of) does just that[0]. It does a bunch of other nonsense, but the main goal is a more "human" way of interacting, e.g. `page.find('text:bold:contains("Summary").below().extract_text()`.

And since every PDF is its own bespoke nightmare, I'm also trying to build up a collection of awful-to-extract-data-from examples to serve as the foundation for a how-to library[1].

[0] https://jsoma.github.io/natural-pdf/

[1] https://badpdfs.com/