| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danielrhodes 1554 days ago

This really does not resonate at all, and I have the scars to prove it.

I used to work on a browser-based document management system, and I would have used (or at least tried) all of these APIs without hesitation. PDFs are a pain and the mish mash of poor functioning tools that exist provides a constant headache.

1) OCR'ing of a PDF is difficult. The only good service is Google, but requires that you break it into pages as images to be performant. This would have simplified things greatly. Even if the PDF has text inside and is not an image, it can be wrong or not laid out in a linear way, so you have to OCR it. Command line tools do not get you very far. An example: if you OCR or text extract a PDF with multiple columns of text, does it handle the columns well?

2) People want searchable OCR'd PDFs where you can highlight the text, even when it's a bitmap underneath. This requires a technique where you overlay transparent text in the exact position of text in the bitmap. This does not come for free and I've only seen this done on proprietary Windows-only software. This alone would be worth it.

3) Office to PDF is an extremely standard need, especially if you want to display them online. But it's not easy. You have to hack together a headless OpenOffice to have it work at all, but it doesn't do a great job. It's difficult to do well because Office docs are like HTML pages in that it greatly depends on the renderer, not to mention the fonts. Microsoft does not offer a service to do this, unfortunately. If you think anything will do, it really won't: when people see their PDF looks very different than what they saw on Word, they get upset.

4) Table extraction APIs are super important, especially if you are trying to automatically extract data from PDFs (e.g. analyze financial disclosures). There have been whole startups dedicated to this.

5) HTML to PDF is also a pain: you have to set up an instance that is running headless Chromium, which can be quite slow. This has become the defacto standard to quickly create complex PDFs. Having a simple API wrapper around this is just one less thing to manage.

The rest of the APIs, like the merging/splitting/watermarking etc., are pretty standard and you do not need APIs if you already have access to the PDF on a server. But if you were in a browser or on mobile, you might not.

7 comments

ankrgyl 1554 days ago

I'll just throw my hat in the ring and mention that at Impira, we are one of those startups wholly dedicated to (4). We happen to use Google's OCR engine (1) under the hood (for raw OCR), and what you said resonates for sure: there's a lot of engineering work required to make it work performantly and generally (happy to chat about this with anyone who is interested).

Feel free to take Impira for a spin (https://www.impira.com) if you need to accurately extract data from PDF documents. Would love feedback from anyone who tries it out. [Disclaimer: I am the CEO/Founder of Impira].

link

jfk13 1554 days ago

I agree many of these things are a pain. This often reflects a workflow that is approaching things from entirely the wrong direction. ("If I wanted to go there, I wouldn't start from here.")

E.g. instead of trying to OCR a PDF, go back to the source document or database or whatever from which the PDF was generated. (Yes, I know that's not always an option. But it should be the first avenue to explore. We should push back against people who send around PDFs as though they were an all-purpose interchange format for textual or structured data.)

I'm a bit puzzled by (3), though:

> Office to PDF ... it's not easy ... when people see their PDF looks very different than what they saw on Word, they get upset

To get a PDF that looks the same as the Word document, just tell them to use the Print to PDF driver from right there within Word.

link

ankrgyl 1554 days ago

I think you recognize this already, but to add a bit of color, in highly regulated industries (e.g. financial services) and B2B settings with lots of peers (e.g. supply chain), "going back to the source document or database or whatever" requires an insane amount of consensus (which is not currently incentivized).

To add to that, a lot of PDFs (e.g. financial reports) are generated procedurally with ancient code that would have to be rewritten to generate a different format. The underlying database format is often many layers of abstraction different than the final output.

link

pipeline_peak 1554 days ago

> Office to PDF is an extremely standard need

Is it really an extremely standard need or just something that appears in the bs corners of our jobs a few times a year.

link

danielrhodes 1554 days ago

Yes, if you're working with documents a lot it is. Word docs are not portable and people don't like them because they can be changed easily, not to mention not everybody has Word. You also can't display them in inline in a browser.

link

yyyk 1553 days ago

>HTML to PDF is also a pain: you have to set up an instance that is running headless Chromium, which can be quite slow...

There are at least 6 non-Chromium alternative that I can think of in a moment's notice, and also LGPL wkhtmltopdf.

>Office to PDF.... You have to hack together a headless OpenOffice to have it work at all, but it doesn't do a great job... Microsoft does not offer a service to do this, unfortunately.

Microsoft sorta does offer a service to do this. Sharepoint has a word to pdf action, and with some stitching you can make it into an API. There are also several commercial solution (e.g. Spire.NET) for this and also ways exist to mangle the OpenXML into HTML (of course losing some fidelity into the process).

link

amluto 1552 days ago

All of the above may be correct, but nothing here advocates for a web service instead of licensed software. If I want to solve a linear program, I can use an open source library or I can pay for a commercial offering, but that commercial offering will run on my hardware (or cloud instance) and will operate independently of the network. If I want to edit a Word document, I can pay Microsoft for a local copy of Word.

link

jcuenod 1554 days ago

I'm a very happy user of OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF/

link

eastendguy 1554 days ago

> 1) OCR'ing of a PDF is difficult. The only good service is Google

OCRspace is OK, too, and easier to use. You can just send the PDF. It is free for PDFs with 3 or less pages.

> 2) People want searchable OCR'd PDFs where you can highlight the text, even when it's a bitmap underneath.

OCRspace can also create searchable PDFs: https://ocr.space/searchablepdf

link