Hacker News new | ask | show | jobs
by vintagedave 1426 days ago
Given the mention of security issues in their custom PostScript extensions, and that PDF files are often malformed, I wonder why they chose C as the language for the new interpreter. I don't want to write a typical HN comment (cough use Rust for everything :)) but surely there is _some_ better language for entirely new development of a secure and fast parser in 2022.

The post has no explananation of this choice. Does anyone know?

9 comments

Beyond a lack of memory safety, C has another issue that makes me dislike it for this kind of application: C has a very minimal set of built in data structures. Combined with a lack of generics, this means that using, say, a dictionary means that quite a bit of the implementation gets hard coded into every site that uses the dictionary. This is almost invariably done with lots of pointers (since C has no better-constrained reference type), and the result can be bug-prone and difficult to refactor.

For all of C++’s faults, at least it’s possible to use a map (or unordered_set or whatever) and mostly avoid encoding the fact that it’s anything other than an associative container of some sort at the call sites. This is especially true in C++11 or newer with auto.

[WUFFS](https://github.com/google/wuffs) is made for stuff like this, and it has a library available as transpiled C code.
> this means that using, say, a dictionary means that quite a bit of the implementation gets hard coded into every site that uses the dictionary

I don't understand this part of your comment. There's nothing preventing you from designing a nice well-encapsulated map/dictionary data structure in C and I'm sure there are many many libraries that do just that.

I do agree though that having such basic data structures in the standard library, as modern C++ does, is usually preferable.

Lack of generics will do that, unless you consider that blindly casting `void ` all over the place counts as "well-encapsulated". Even with macro-soup designing a good agnostic dictionary implementation for C is rather challenging. Linked lists are okay* if you use something like the kernel's list.h, but even then it's macro-heavy and has its pitfalls.

In my work as an embedded developer I still use C a lot and it's probably the programming language I know best and have the most experience with but it would never cross my mind to write a PDF interpreter in it unless I had a tremendous reason to do so. There are so many better choices these days.

Type safety and encapsulation are distinct issues. The Linux kernel uses many well-encapsulated interfaces but it's written in C and the typing reflects that limitation.

Personally I haven't used straight C in years and would never choose it over C++ unless platform constraints required it, but a vast amount of very complex software has been and continues to be written in C, including all the widely used OS kernels, so I don't find it very surprising that a new feature in a very old piece of software would be written in it.

Except when you need to build from source; you'll need yet another whole compiler toolchain that may or may not behave well on a specific environment - eg, do you kow how well rust (or other "modern" language) works in late-nineties mips systems? The c compiler is the lowest common denominator.
> There's nothing preventing you from designing a nice well-encapsulated map/dictionary data structure in C

When you write a set function for your map data structure, what type do you make the key parameter?

Code from yalsat (stochastic SAT solver) [1] made me learn something two years ago. I can declare an array of some elements and make access to elements statically typed. Same with maps, sets and others.

[1] https://github.com/msoos/yalsat/blob/main/yals.c#L49

this is a pointer-based language so there are lots of ways to solve that, but you know that already.. this is a setup question.. of course its not useful to re-invent critical, secure functions over and over yet, what if I am not writing critical, secure functions anyway?

I would choose a key type that is natural to the environment and problem.. unsigned integers are useful. Which unsigned integer size? there are only a couple of practical answers to that.. unless there is some massive dataset, use a 32bit unsigned integer, like so much of the software does right now.

size_t key_size, void *key
And then eschew type safety
> nice well-encapsulated

...

> void *

Type safety and encapsulation aren't the same thing. Encapsulation is about hiding implementation details from the user of an API, which is what the comment I originally replied to was claiming you couldn't do in C.
Code reuse is achievable by (mis)using the preprocessor system. It is possible to build a somewhat usable API, even for intrusive data structures. (eg. the linux kernel and klib[1])

I do agree that generics are required for modern programming, but for some, the cost of complexity of modern languages (compared to C) and the importance of compatibility seem to outweigh the benefits.

[1]: http://attractivechaos.github.io/klib

My guess is that since the rest of the project (not in PS itself) is in C, it’s in C. And it may be borrowing from the PS interpreter codebase. I dunno.

Requiring another skillset, toolchain, etc. is onerous and has to be weighed in those decisions. Rust is cool for sure, but difficult to adopt in brownfield projects because of humans more than tech.

Also, it wasn’t written on in 2022, just made the default now. GS is a venerable codebase, and jumping on a “new” language bandwagon may have seemed dangerous at the time it was started.

All conjecture. I’m not an expert or involved.

We (Latacora) previously advised clients to encapsulate GhostScript processing in something with a hard security boundary (like a Lambda) and I am not expecting the new implementation to change that.
Is this AWS Lambda or what kind of "Lambda" is this about?
Yep, AWS Lambda.
It looks like it needs to interoperable with the rest of their codebase which was already written in C

> The new PDF interpreter is written entirely in C, but interfaces to the same underlying graphics library as the existing PostScript interpreter. So operations in PDF should render exactly the same as they always have (this is affected slightly by differing numerical accuracy), all the same devices that are currently supported by the Ghostscript family, and any new ones in the future should work seamlessly.

That is not an argument at least for rust since its super easy to consume and offer a C interface. I think it's more of a shift in mentality that needs to occur.
while it doesn't prevent rust from being used, it is still a hurdle which must be overcome. Building and maintaining a multi-language build system has significant costs, especially with a project with as much history and wide use as ghostscript.
It is so easy and well documented that first page of google results for “rust autotools” does not contain anything about how to integrate rust code into existing autotools project.

Another issue is general subtle brokenness of rust tooling on anything that is not linux on amd64.

I suspect they need portability more than most projects.
Are you kidding? Many other languages are as portable, if not more portable.[α] Your point would be valid in 1972, not in 2022. I can't believe you're regurgitating the same "portability" from 50 years ago, today (unless you meant it as a joke and forgot to include a /s).

[α] Languages targeting LLVM or supported by GCC are portable to every target machine code / ISA / architecture supported by those toolchains. JVM, JS, etc are portable to all the platforms they support. You don't need to do any extra work (of recompiling) if you use a bytecode VM / platform (for example, like JVM).

Well, there's portability and then there's portability. Getting LLVM to emit artifacts on a given target is easy. Getting assurance that big, complex interfaces that integrate with the underlying OS in extremely specific ways (i.e. your programming language's IO or concurrency system) behave correctly on that target, and have appropriate testing, community support, and documentation is another thing entirely.

Like, I get it. The claim that "rust isn't portable" is often used as a thought terminating cliche, and is often wrong or irrelevant in context. But the claim "X uses LLVM, LLVM can target environment Y, therefore X is fully compatible with Y" is just as reductive and misleading.

does an LLVM requirement fit the social and license goals of this eco-system fundamental project?
I don't even actively code with rust but just from the fact that its been packaged as a dependency has been enough of a headache for me. The latest issue is with some homebrew package that has rust as a dependency. It turns out on macos mojave rust needs to be built from source since there is no bottle. I let it build for a full day and it still didn't finish building, so I gave up. Then I installed rust independently with rustup and successfully linked that install to brew, which nearly worked, but failed with the cryptic "rustup could not choose a version of cargo to run..." error that I can't make any sense of, because the solution it gave for that error to download the latest stable release and set it as your toolchain with 'rustup default stable' didn't do anything because that was already done. The real salt on the wound is that modern google search bringing up nothing relevant.
WUFFS seems like a great option for this.
One reason may be that they want to build a high level wrapper of that C API, something that is well documented in some languages (i.e. Python)
No, not more Rust activism. Please, anything but more of this. Have some shame.