I find it absolutely ridiculous that we have to resort to these kinds of tools :/
We have digital formats, and we decided to standardize document distribution on the one that makes it as hard to extract data as if it were on physical paper.
PDF is a perfectly fine and rich digital format. It also allows you to do proper copy and paste, which is much saner than anything paper offers.
Sure, PDF is a light on context clues for automation and is targeted purely at humans. But formats targeted at both computers and humans consistently fail (XML with accompanying XSLT comes to mind), and/or only have terrible tools for creating files (easily parsable, pretty HTML).
Either there is very little real demand or we consistently fail at making alternatives viable.
> It also allows you to do proper copy and paste, which is much saner than anything paper offers.
Technically it does that, yes. Rarely do I see people taking advantage of it, though; most of the times I tried to copy some text out of PDF, the result had to undergo a significant cleanup before becoming usable.
I've had similar experiences, as well as worse; one PDF I got from a bank about my student loans a year or so back had ostensibly only text content, but none of it was even able to be selected.
PDF is not very fine. Copy-paste from PDF very often results in complete rubbish, even when it is not deliberately prevented (which the format allows, and then you have to do OCR).
People purposefully disallowing copy-paste isn't a problem with PDF: in other formats they would have embedded a picture, at least with PDF you get the other advantages of proper text: infinite zoom and great compression. Sadly there's also a lot of PDFs that are little more than a picture collection that looks like text, but that's hardly the file format's fault.
It really is a problem with PDF that it's too easy to get a file where copy and paste yields a different result than what's displayed. But this varies widely with the software used for creating the file (e.g. latex ligatures never work in copy-paste)
When the PDF is a picture collection that looks like text that's when PDF is being used correctly, because that's when something was scanned out of paper and put on a paper-like format for computers, PDF.
When people write text and data and tables on the computer then put it on a paper-like format to share that's when the problem happens.
Have you ever actually tried to parse PDF with software? It's a sheer nightmare. PDF often gets produced from text processors that have very rich format information. PDF strips it all out and then you somehow have to recreate it.
That's something that should be pushed by the developer community, I think. Perhaps having an HTML client for people who nowadays use PDF writers and readers, with the option to tag data in some easily parseable format (if the data isn't already coming in a table).
This should output a single file and ideally it should have some way of ensuring the author it won't be modified unnoticed (that's one of the features common people use PDF for, today, they think it is something no one can modify) -- perhaps signing it with a key from Keybase would work in the mid term.
epub is html-based, and their standards body recently got absorbed by the W3C. I think it would be a fine replacement for some of the uses that PDF gets (such as distributing research papers). Unfortunately I don't see it happening any time soon, PDF is so ubiquitous right now and there's very few tools that let you "save to epub". Chicken & egg.
That's Adobe. Look at their other formats, and PDF seems to be one of their better ones. Compare to SWF, PSD, AI and so on.
PDF is the successor of PostScript. PostScript is a stack-based programming language where anything can happen, while PDF enforces some document structure and metadata structure on top of it, so you can e.g. at least determine where pagebreaks are, without having to interpret ("run the code of") the whole document.
Still, PDF is simpler than PostScript in the same sense that XML is a simplification of SGML. Jumping from PDF to a well-designed format would be like jumping from XML to JSON or S-Expr.
It's because PDFs have no concept of lines or paragraphs. It's just characters at an x,y co-ord which happen to line up. So figuring out whats a line or a column is a pain in the ass.
That's most likely why copying and pasting sucks too.
I had to use Tabula to extract a decade of SAT scores from PDFs for each state/year. It was a nightmare, but I managed it. More recently, I was hoping to do something similar with decennial census data, but it was just too much. Far, far too many groups publish data to PDF, which is about as bad as if they just deleted it straight-out. It's very upsetting.
PDF is fucked up beyond all doubt. But there seems to be no better (even if unpopular) alternative.
How do you imagine a better format PDF alternative? On the one hand, we have text-based formats. They are not serialization of the exact rendering. On the other hand, we have ps, which is, probably, too complex to be manipulated as text when rendered. PDF and dejavu do kinda both, even if quite imperfectly.
So how do we construct a file format, which can render a symbol (not necessarily a unicode one) anywhere, pixel-perfect, but still has concept of words, paragraphs, preferably tables and such?
epub is the way to go I think. PDF is an overengineered abomination. It nicely serves the purpose of "there is only one and exactly one way to render this", but then again, just about so does an image.
I also think that, with more love, epub could get there too. It's not an easy problem, but if we can crack SHA-1 I'm sure we can crack this one too :)
It's not as versatile (no forms, for example), but layout- and prepress-wise it seems to be as good as PDF (with the benefit that it retains the structure).
Tabula is a great tool. In my experience it's the most reliable open source software for extracting tables from PDFs. We are using their underlying Tabula-Java library for some parts of https://docparser.com and are happily sponsoring their project.
I didn't know about Tabula and i've given a try at the instant. Apparently it only extracts tables and ignores everything around. This might be good in some cases but it is a problem if you want to extract a form, a whole textbook, your bank statements or anything. Also, I noticed that Tabula has some slight troubles when columns are not drawn in the table. But overall it is a good tool for extracting only tables, that's true.
Tabula is the nice free tool but requires technical background to run it. There is a free https://pdf.co with both online and offline tools (Windows) for PDF to CSV. (disclaimer: i work on it)
Hi there. We try to make a tool that's as simple to use as possible (given the constraints of a volunteer-run project such as Tabula). What technical background do you think is required to use it? (disclaimer: I'm the main author of Tabula)
hi and thank you for your work on Tabula! Well, some months ago I've advised to try Tabula and the first thing was the Java download page opened without any explanation. She managed to install java runtime and to try again but when was trying to upload files it was displaying either internal server error in jruby message or just a plain json in the browser.
So, in my opinion and experience it may require some efforts to run it (at least for the first time). But to _use_ it, for sure, no such a technical background is required.
We have digital formats, and we decided to standardize document distribution on the one that makes it as hard to extract data as if it were on physical paper.