|
|
|
|
|
by crispyambulance
924 days ago
|
|
> I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse. Is this sarcasm? AFAIK, pdf is deliberately designed to not give a F about semantics. There is no way to determine what is part of what in a pdf document. All you got is association by adjacency. Hasn't it always been that way? Has something changed? |
|
When text is not in reading order within a PDF page it is often headers, footers, captions, callouts, block quotes....
There are I believe features in the modern PDF spec to allow for accessibility that would give you more structure to that raw text. I am not sure that this is a widely used feature when creating PDFs though.