Hacker News new | ask | show | jobs
by gyulai 1559 days ago
Not a good usecase for an online API. To the extent that those PDFs could include sensitive information, there's a huge security/privacy headache there, with no real benefit when compared to performing these functions offline. It also seems to me a lot more expensive than alternative ways of doing the same thing.
6 comments

Very valid concern around privacy. We don't store the documents (see https://pspdfkit.com/api/privacy/), but for people that have sensitive documents to process, we offer an on-prem product, see https://pspdfkit.com/api/documentation/deployment-options/. You can run it in your own infra and it doesn't report any telemetry to us, so information remains completely private.
> with no real benefit when compared to performing these functions offline

Most organizations these days are developing in the cloud. I assume you don't mean "offline" but rather performing these functions yourself.

I've tried, but it's a huge pain in the butt. PDFs are very quirky. Things work well 95% of the time and the 5% takes a lot of time to figure out.

When trying to do this myself in an app deployed to AWS, I've had many issues with getting all characters in different languages to work. Every few months, some new thing in a PDF file throws an error and the file won't generate. You get weird file size errors. And the quality of PDF generation varies a lot by language. I'd much rather have an API that I can just call from any of my code if it JUST WORKS.

Now, their pricing is strange and might be a dealbreaker for me. I'd like to see an option to pay per transaction with no cap without having to negotiate with their sales team.

> there's a huge security/privacy headache there

eh, maybe. Some use cases don't require privacy. In my case, I'm mostly assembling PDFs from various sources with my company's documents. No, I don't want a vendor that is going to post my documents to twitter, but I can sleep at night if I have some kind of assurance that they don't use or sell my data.

By negating the cloud model, I'm not saying I want to do stuff "myself" at all. I just want the delegation to work differently.

By "offline", I mean "offline" w.r.t. their servers. The point is: I have some kind of an environment that I'm already using for handling the data underneath the PDFs I'm trying to produce. That environment has a given amount of attack surface area. If they hand me a program that I can run in my environment and that doesn't communicate outside that environment, I have gained additional functionality. This is additional stuff I can do that I couldn't do before, and I've achieved this without increasing the attack surface area that might lead to that data getting compromised.

If I do it the other way around, i.e. instead of them handing me a program, I am handing them my data, then the attack surface area increases. Because every attack against my pre-existing environment continues to be an attack that will compromise my data. But also: Some attacks against their environment would now also end up compromising my data. So it's worse for security.

If I have a car in a garage, and I give the key to the garage to 3 people, it's going to be more secure than if I give the key to 4 people. Because there's one less person who might lose the key and enable a robber to get in.

I don't understand the logic behind the converse idea at all. The idea seems to be: Azure/AWS/whatever is "the cloud". Therefore my data is already in "the cloud". Random company X is also in "the cloud". So I might as well send my data to company X. -- This sounds to me like the What-The-Hell Effect. Like: I've broken my diet because I ate a hamburger. Now I might as well quit the diet. No: Eating fewer hamburgers tomorrow is still better than eating more.

I also don't understand why things have to be architected that way in order to "just work". Weasyprint just works. Pandoc just works. LaTeX just works. I can put them on a computer with no network connection, and they'll happily do their job for me. They give me a lot of functionality and ask very little trust in return. That's a good thing. Whenever that's an option, that's what I'm going to do.

That's a great point. For folks that have strong privacy needs, we do have an on-premise product that provides the same functionality [1].

[1] https://pspdfkit.com/server/processor/

So what exactly does that leave? A wrapper that you've created around weasyprint, pandoc, latex, ghostscript, imagemagick, and stuff like that?

Sounds to me like an unnecessary extra expense for an unnecessary extra layer of abstraction. And there's a risk factor that comes with it: Say I make a nontrivial investment, like write a book that I'm planning on typesetting with this, or write a reporting infrastructure that creates automated reports or something. I'll make a huge up-front investment there that is tied to your API. Then I want to run this, while not touching it, for 10 years so it can earn a return on investment.

Then I come back to it 10 years later, because I'm writing the second edition of the book, or I want to change something about my reporting infrastructure. Has your company gone out of business in the meantime? Have you deprecated the product? Do you still support the API from 10 years ago? Does it still produce the same output for the same input? ...or do I need to take a huge write-off on all the work I've done on the typesetting my book or hooking up my reporting infrastructure?

In the open source world, I'd just make sure to bundle all the tools I'm using, including their sourcecode, in a docker container or something. In the "10 years later" scenario, I'll probably need to touch only the book's sourecode, or the reporting infrastructure's sourcecode, not the typesetting infrastructure. And if there's something I really really need, then I can go to the source and change it.

You’re touching on a few different points so I’ll try to cover everything.

- We do build on top OSS (just not those programs you listed - see https://pspdfkit.com/legal/acknowledgements/processor-acknow... for a complete list). The layer we build is quite large though, and it would take many person-years to replicate in its entirety. It’s possible though that you don’t need that at all and a focused program that wraps other ones might do the trick for your use case.

- If you build a product based on our tech, you’re taking a conscious decision about risk: while I do think we’re gonna be in business in 10 years (we have solid revenue and last year we got backed by a large investor, Insight), that we would version APIs and support you (not just during upgrades), the reality is that it is indeed possible that we’re not gonna be around anymore, like every other company on the planet. As a consumer, this is the reality for most of the things we buy nowadays. We do take deprecation seriously, as sell SDKs, and I’m sure in case of the company shutting down you would have enough time to migrate.

- Depending on what you need to build, using our product may shortcut your development time by a large factor. It may not, if you just need to rotate pages of a PDF document and there’s a reliable OSS package that does that in your language of choice. It really depends on what you need to do.

- Even if you package everything with OSS, waiting 10 years is a sufficiently large amount of time that it may not work and you have to fork and rebuild yourself. It’s a different type of risk, but still a risk. 10 years ago Docker had just been launched. Whether you build something on OSS or commercial, you would wanna test things once a year to see if they still work or keep up with security and bug fixes.

Ultimately, there are situations where the approach you described is sound: for example, I do my taxes in plain text accounting, using ledger and emacs. I generate the reporting via a couple of Ruby scripts. I do that exactly because I care about longevity: I do my taxes once a year, I don’t wanna spend time fixing the toolchain every time I have to do them. Yet every year I hit a couple of snags I have to fix, but I consider that acceptable.

It's unclear what you're trying to say. They've been around since 2010 and they have quite a large team, why would they suddenly disappear? Also what do you want them to do?
What I wanted to say was: "PDF processing is something where I wouldn't want to rely on an online API over something local. And it's also something where I wouldn't want to rely on a small commercial company over an open source project".

I once worked for a software company where non-tech clients would have custom-made software developed for their exclusive use. Half the projects we did were "We're relying for one of our business functions on this software that we bought from this company that's now out of business. We need you to reimplement it from scratch, because we need a tiny change."

To me it makes sense to buy this functionality instead of building it yourself, the upfront cost involved with building it yourself will likely be much higher even if you manage to chain together a bunch of open source tools.
Then I come back to it 10 years later, because I'm writing the second edition of the book, or I want to change something about my reporting infrastructure.

Those seem like one-off PDF conversion use cases that MS Word or Acrobat can easily handle. Not a high-volume, daily PDF invoice use case.

Would work if you want to publish the pdf anyway.
Vast majority of organisations already store all their working documents and data in the cloud.
"The cloud" is not one thing. Each additional company in "the cloud" that gets to see your data increases the attack surface.
FUD. Most companies are using Box, Dropbox, or some variant of cloud-based document storage today. Extending storage services with document transformations and conversions is a logical evolution.