|
|
|
|
|
by spacecaps
398 days ago
|
|
I was irritated that I couldn't extract data from PDFs in a similar way to web pages + BeautifulSoup, so I built a library that (kind of) does just that[0]. It does a bunch of other nonsense, but the main goal is a more "human" way of interacting, e.g. `page.find('text:bold:contains("Summary").below().extract_text()`. And since every PDF is its own bespoke nightmare, I'm also trying to build up a collection of awful-to-extract-data-from examples to serve as the foundation for a how-to library[1]. [0] https://jsoma.github.io/natural-pdf/ [1] https://badpdfs.com/ |
|