|
|
|
|
|
by amluto
1556 days ago
|
|
I find this utterly bizarre. Once upon a time, if you wanted to left pad a string, you would just do it. A while later, people discovered that you could use a library. (I’m joking a bit here, but libraries are genuinely useful.). With a library, you get to pick from various schemes and schedules for updating the library, but you have a degree of control. But now apparently you’re supposed to use a web API and depend on an external service. This has all kinds of downsides: it has latency (and potentially tail latency). It has larger security issues. It doesn’t work in many sandboxes. It requires an asynchronous call. Callers have to handle timeouts and retries. (If you left pad a string with a normal library, it either works or it doesn’t. With a web service, it can fail transiently or give wrong answers transiently.). It updates on its own schedule, without notice, and cannot be rolled back. And it can charge an utterly outrageous per-call price, so instead of merely profiling and debugging slowness due to making too many calls, developers also have to worry about inadvertently spending hundreds of thousands of dollars. Replace “left pad a string” with “generate a PDF” and you get this. Why is this desirable? I suppose things like this may partially explain the stunning slowness of bank websites. |
|
I used to work on a browser-based document management system, and I would have used (or at least tried) all of these APIs without hesitation. PDFs are a pain and the mish mash of poor functioning tools that exist provides a constant headache.
1) OCR'ing of a PDF is difficult. The only good service is Google, but requires that you break it into pages as images to be performant. This would have simplified things greatly. Even if the PDF has text inside and is not an image, it can be wrong or not laid out in a linear way, so you have to OCR it. Command line tools do not get you very far. An example: if you OCR or text extract a PDF with multiple columns of text, does it handle the columns well?
2) People want searchable OCR'd PDFs where you can highlight the text, even when it's a bitmap underneath. This requires a technique where you overlay transparent text in the exact position of text in the bitmap. This does not come for free and I've only seen this done on proprietary Windows-only software. This alone would be worth it.
3) Office to PDF is an extremely standard need, especially if you want to display them online. But it's not easy. You have to hack together a headless OpenOffice to have it work at all, but it doesn't do a great job. It's difficult to do well because Office docs are like HTML pages in that it greatly depends on the renderer, not to mention the fonts. Microsoft does not offer a service to do this, unfortunately. If you think anything will do, it really won't: when people see their PDF looks very different than what they saw on Word, they get upset.
4) Table extraction APIs are super important, especially if you are trying to automatically extract data from PDFs (e.g. analyze financial disclosures). There have been whole startups dedicated to this.
5) HTML to PDF is also a pain: you have to set up an instance that is running headless Chromium, which can be quite slow. This has become the defacto standard to quickly create complex PDFs. Having a simple API wrapper around this is just one less thing to manage.
The rest of the APIs, like the merging/splitting/watermarking etc., are pretty standard and you do not need APIs if you already have access to the PDF on a server. But if you were in a browser or on mobile, you might not.