| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nemild 3401 days ago
	For those interested in converting PDF tables into CSV, there's also Tabula ( http://tabula.technology/ ) (Used by many journalists to analyze the data in PDFs)

4 comments

scrollaway 3401 days ago

I find it absolutely ridiculous that we have to resort to these kinds of tools :/

We have digital formats, and we decided to standardize document distribution on the one that makes it as hard to extract data as if it were on physical paper.

link

halomru 3401 days ago

PDF is a perfectly fine and rich digital format. It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Sure, PDF is a light on context clues for automation and is targeted purely at humans. But formats targeted at both computers and humans consistently fail (XML with accompanying XSLT comes to mind), and/or only have terrible tools for creating files (easily parsable, pretty HTML).

Either there is very little real demand or we consistently fail at making alternatives viable.

link

TeMPOraL 3401 days ago

> It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Technically it does that, yes. Rarely do I see people taking advantage of it, though; most of the times I tried to copy some text out of PDF, the result had to undergo a significant cleanup before becoming usable.

link

saghm 3400 days ago

I've had similar experiences, as well as worse; one PDF I got from a bank about my student loans a year or so back had ostensibly only text content, but none of it was even able to be selected.

link

ptaipale 3401 days ago

PDF is not very fine. Copy-paste from PDF very often results in complete rubbish, even when it is not deliberately prevented (which the format allows, and then you have to do OCR).

link

halomru 3401 days ago

People purposefully disallowing copy-paste isn't a problem with PDF: in other formats they would have embedded a picture, at least with PDF you get the other advantages of proper text: infinite zoom and great compression. Sadly there's also a lot of PDFs that are little more than a picture collection that looks like text, but that's hardly the file format's fault.

It really is a problem with PDF that it's too easy to get a file where copy and paste yields a different result than what's displayed. But this varies widely with the software used for creating the file (e.g. latex ligatures never work in copy-paste)

link

fiatjaf 3401 days ago

When the PDF is a picture collection that looks like text that's when PDF is being used correctly, because that's when something was scanned out of paper and put on a paper-like format for computers, PDF.

When people write text and data and tables on the computer then put it on a paper-like format to share that's when the problem happens.

link

maxxxxx 3400 days ago

Have you ever actually tried to parse PDF with software? It's a sheer nightmare. PDF often gets produced from text processors that have very rich format information. PDF strips it all out and then you somehow have to recreate it.

link

maxxxxx 3400 days ago

PDF has no paragraphs, often not even words. No concept of font notes. It doesn't flow well with different screen sizes.

If you designed a real bad format on purpose, it would be hard to top PDF. Maybe Photoshop files are worse.

link

brianwawok 3399 days ago

Have you seen the spec for .doc and .xls ?

link

maxxxxx 3399 days ago

I don't even want to know :)

link

grkvlt 3398 days ago

Oh, you really do! The format for COM object based documents like XLS and DOC is actually a FAT filesystem: https://en.wikipedia.org/wiki/Compound_File_Binary_Format

link

fiatjaf 3401 days ago

HTML is a viable alternative. And it is something everyone can parse easily, better yet if the data is tagged with classes somehow.

link

fiatjaf 3401 days ago

That's something that should be pushed by the developer community, I think. Perhaps having an HTML client for people who nowadays use PDF writers and readers, with the option to tag data in some easily parseable format (if the data isn't already coming in a table).

This should output a single file and ideally it should have some way of ensuring the author it won't be modified unnoticed (that's one of the features common people use PDF for, today, they think it is something no one can modify) -- perhaps signing it with a key from Keybase would work in the mid term.

https://github.com/iffy/lhtml has something going in this direction.

link

scrollaway 3401 days ago

epub is html-based, and their standards body recently got absorbed by the W3C. I think it would be a fine replacement for some of the uses that PDF gets (such as distributing research papers). Unfortunately I don't see it happening any time soon, PDF is so ubiquitous right now and there's very few tools that let you "save to epub". Chicken & egg.

link

vog 3401 days ago

That's Adobe. Look at their other formats, and PDF seems to be one of their better ones. Compare to SWF, PSD, AI and so on.

PDF is the successor of PostScript. PostScript is a stack-based programming language where anything can happen, while PDF enforces some document structure and metadata structure on top of it, so you can e.g. at least determine where pagebreaks are, without having to interpret ("run the code of") the whole document.

Still, PDF is simpler than PostScript in the same sense that XML is a simplification of SGML. Jumping from PDF to a well-designed format would be like jumping from XML to JSON or S-Expr.

link

simooooo 3400 days ago

It's because PDFs have no concept of lines or paragraphs. It's just characters at an x,y co-ord which happen to line up. So figuring out whats a line or a column is a pain in the ass.

That's most likely why copying and pasting sucks too.

link

wslh 3400 days ago

Yes, and more when you want to send a PDF based document to a Kindle.

link

acbart 3401 days ago

I had to use Tabula to extract a decade of SAT scores from PDFs for each state/year. It was a nightmare, but I managed it. More recently, I was hoping to do something similar with decennial census data, but it was just too much. Far, far too many groups publish data to PDF, which is about as bad as if they just deleted it straight-out. It's very upsetting.

link

krick 3400 days ago

PDF is fucked up beyond all doubt. But there seems to be no better (even if unpopular) alternative.

How do you imagine a better format PDF alternative? On the one hand, we have text-based formats. They are not serialization of the exact rendering. On the other hand, we have ps, which is, probably, too complex to be manipulated as text when rendered. PDF and dejavu do kinda both, even if quite imperfectly.

So how do we construct a file format, which can render a symbol (not necessarily a unicode one) anywhere, pixel-perfect, but still has concept of words, paragraphs, preferably tables and such?

link

scrollaway 3400 days ago

epub is the way to go I think. PDF is an overengineered abomination. It nicely serves the purpose of "there is only one and exactly one way to render this", but then again, just about so does an image.

I also think that, with more love, epub could get there too. It's not an easy problem, but if we can crack SHA-1 I'm sure we can crack this one too :)

link

krick 3399 days ago

I don't see how epub can be pixel-perfect. It's almost as much a markup format, as fb2. Clearly more explanation of how should it be done is in order.

link

scrollaway 3399 days ago

Pixel perfection is not necessary for 99.999% of the cases PDF is used in.

link

krick 3394 days ago

That's just ridiculous statement.

link

Mikhail_Edoshin 3400 days ago

Microsoft XPS?

link

krick 3399 days ago

This is interesting. I never considered this one. How is it inferior to PDF, so that it is so much less widely spread?

link

Mikhail_Edoshin 3399 days ago

It's not as versatile (no forms, for example), but layout- and prepress-wise it seems to be as good as PDF (with the benefit that it retains the structure).

link

krakaukiosk 3401 days ago

Tabula is a great tool. In my experience it's the most reliable open source software for extracting tables from PDFs. We are using their underlying Tabula-Java library for some parts of https://docparser.com and are happily sponsoring their project.

link

jlink 3401 days ago

I didn't know about Tabula and i've given a try at the instant. Apparently it only extracts tables and ignores everything around. This might be good in some cases but it is a problem if you want to extract a form, a whole textbook, your bank statements or anything. Also, I noticed that Tabula has some slight troubles when columns are not drawn in the table. But overall it is a good tool for extracting only tables, that's true.

link

eumm 3401 days ago

Tabula is the nice free tool but requires technical background to run it. There is a free https://pdf.co with both online and offline tools (Windows) for PDF to CSV. (disclaimer: i work on it)

link

jazzido 3400 days ago

Hi there. We try to make a tool that's as simple to use as possible (given the constraints of a volunteer-run project such as Tabula). What technical background do you think is required to use it? (disclaimer: I'm the main author of Tabula)

link

eumm 3399 days ago

hi and thank you for your work on Tabula! Well, some months ago I've advised to try Tabula and the first thing was the Java download page opened without any explanation. She managed to install java runtime and to try again but when was trying to upload files it was displaying either internal server error in jruby message or just a plain json in the browser. So, in my opinion and experience it may require some efforts to run it (at least for the first time). But to _use_ it, for sure, no such a technical background is required.

link