|
|
|
|
|
by kwon-young
317 days ago
|
|
Just to illustrate this point, poppler [1] (which is the most popular pdf renderer in open source) has a little tool called pdf2cairo [2] which can render a pdf into a svg.
This means you can delegate all pdf rendering to poppler and only work with actual graphical objects to extract semantics. I think the reason this method is not popular is that there are still many ways to encode a semantic object graphically. A sentence can be broken down into words or letters. Table lines can be formed from multiple smaller lines, etc.
But, as mentioned by the parent, rule based systems works reasonably well for reasonably focused problems. But you will never have a general purpose extractor since rules needs to be written by humans. [1] https://poppler.freedesktop.org/
[2] https://gitlab.freedesktop.org/poppler/poppler/-/blob/master... |
|
Rastering and OCR'ing PDF is like using regex to parse XHTML. My eyes are starting to bleed out, I am done here.