Hacker News new | ask | show | jobs
by nu11ptr 1044 days ago
The article has fair points, but after trying OCaml and Rust... I chose Rust. Without going into huge amounts of detail, a compiler is more than simply a parser/ast/code generator and there are other aspects to consider such as the richness of the ecosystem, editor support, etc. Also, I suspect the author is more familiar with OCaml than Rust as you wouldn't typically box everything but likely use an arena for the AST. In the same way, I am more familiar with Rust than OCaml, so some of the warts I observed may be to lack of familiarity. As such I suspect the authors perspective is biased...as is mine. Nothing wrong with that.
2 comments

Can you write an equivalent piece of code that shows why Rust wins here with more familiarity and leveraging the ecosystem? With compilers I don't think there is a huge amount of using the ecosystem. I think what TFA does is a good case study in trying to be objective: write it both ways and compare.
I'm just saying a compiler is a program, and while the more important things are heavy algorithm related, supporting libraries like those for error handling, etc. all still matter and add up. No problem if you disagree - just my perspective. This isn't so much an objective thing as it is a personal opinion.
Personally, it’s not so much that I disagree. It’s more that Ocaml has a top notch ecosystem for compiler writing. That’s probably by far its strongest point with library like Menhir having few equivalent in different ecosystem.

Not that there is anything wrong with Rust if you are ready to pay the price of having so much low level things to deal with.

Can you explain what Menhir does better than other parser frameworks? For what it's worth, I'm not a huge fan of parser frameworks. I tend to prefer hand written parsers, either via a combinator library or fully manually.

Rust has a pretty darn good ecosystem too btw. chumsky for parsing, rowan for syntax trees, salsa for incremental computation, miette for errors, inkwell/walrus for code generation.

Wow those library recommendations are fantastic, especially Chumsky. Thanks for sharing.
Yeah chumsky is fantastic, albeit with some rough compile errors and some unsafe. Still my favorite parser library by far and I’ve used a bunch by now.
It can generate elegant and efficient parsers for LR(1) grammars.

> I tend to prefer hand written parsers, either via a combinator library or fully manually.

That’s common with people used to languages which provide poor parser generators.

What makes them more elegant than the average parser? How’s the error recovery? Can you parse into high fidelity syntax trees efficiently?

I don’t know of many production compilers that use parser generators

I don’t think so - I’ve written parsers using flex/lex yacc/bison, antlr, a bunch of functional combinator libraries and maybe others I’ve forgotten but now would never consider anything except hand-written recursive descent with an embedded Pratt parser for expressions and precedence.

Simple to write, debug, recover from errors, provide decent error messages, unit test, integrate into build systems, IDEs etc.

I also believe that nearly all the popular compilers these days do something similar - gcc was rewritten a few years ago in this fashion because of the technical benefits I’ve listed above.

I mean, it’s also common with people who value great feedback in their tools.
> That’s common with people used to languages which provide poor parser generators.

The vast majority of real languages have hand-rolled parsers and there is no real reason they shouldn't.

>That’s common with people used to languages which provide poor parser generators.

lmao

There was nothing objective about the article. The moment I saw seemingly random new lines in the rust for no reason, I knew there was going to be a “line counts!!!!” Sentence. When there was, I stopped reading because it was apparent that all matter of objectivity is completely missing.

I don’t even like rust.

It would be nice if somebody could offer what they consider to be an idiomatic Rust solution to this very routine problem in compilation if they believe the author is being intentionally deceiving. The Rust and OCaml code from the article looked decent to me. m
I don't think anyone was claiming the author was being intentionally deceiving, just that they're not very good at writing idiomatic Rust.

The first thing I'd suggest, as someone who's done their fair share of Rust compiler dev, is to stop using `Box` and `Ref` everywhere. Terms should not own their child terms; the idiomatic way to handle this is to use an arena and typed IDs, and pass a reference to the arena when you need access to the underlying data from the ID.

IIRC The Rust compiler just uses Cell<&’a> for child terms, which is a somewhat lighter weight solution that still allows mutation.

Edit: Either I remembered wrong or it's changed. Now it uses some kind of custom wrapper around Box: https://github.com/rust-lang/rust/blob/master/compiler/rustc... So actually, the Rust compiler itself uses owned children in its AST!

So the solution to writing idiomatic code in the language whose essential feature is the ownership semantics and borrow checker is to not use the ownership semantics and borrow checker?
This is sort of a flippant response, but I'll answer it seriously.

The premise of this question is incorrect: you can't not use the borrow checker in Rust. Instead, you are satisfying the borrow checker by ensuring at the point of use (deref of the ID) that there is valid data by providing a provably-live reference to the arena in which it was allocated.

In another language like C/C++, you'd just use a raw pointer, which is ergonomically-nice, but there's no guarantee that it actually points to a valid object. Rust forces you toward a solution that is guaranteed to be safe, and logically makes sense when you think about it -- nodes in a graph should not own other nodes in the graph; instead, they should be owned by the graph itself, and other nodes should simply hold references to each other. This is how you tend to implement safe and efficient graphs in C++ anyways, the only difference being that in Rust your references are indices, whereas in C++ your references are raw pointers.

> The Rust and OCaml code from the article looked decent to me

> RefCell<u32>

I'm not sure.

For the uninitiated, this is a completely harmless choice. But it suggests a potential lack of understanding of some pretty fundamental parts of Rust.
Although I think the implementation of the gensym function is broken, the article does explain that it wasn’t possible to use &mut u32 because multiple references to the value were required. It would be more idiomatic to use Cell rather than RefCell, but there’s nothing really wrong with using RefCell as far as I can see.
I like Rust but this isn't one of it's strong suits. The code in the article is what Rustfmt outputs (playground link below), so it's at least syntactically idiomatic.

https://play.rust-lang.org/?version=stable&mode=debug&editio...

I've been seriously at it with Rust for about six months now and really loving it. What do you mean by "use an arena for the AST?" What is an arena in this context?
I believe it means flattening the AST, here is a nice blog post about this technique https://www.cs.cornell.edu/~asampson/blog/flattening.html
Nice! Thank you! I'd forgotten about that technique - think I read about it in a ancient compiler book (by french authors no less!)

EDIT: this was a really interesting read!

If you're curious about arena allocators, look at the rust compiler itself. It uses one for all of its allocations as regular heap allocation was not performant enough.
i assume they're talking about an Arena Allocator

https://blog.logrocket.com/guide-using-arenas-rust/

I figured as much but was unsure. So the gain is that you still take heap allocations, but you do it in a "novelty" heap allocator that you will probably dipose of entirely at some point?
Allocation happens roughly like this:

If current top of heap + allocation size > buffer size, allocate an extra buffer for the arena. Save the current top in a temp variable. Add the allocation size to the top Return the temp variable.

It's fast, and the amortized memory overhead per allocation is near zero because you don't need to track allocation sizes or free lists, since you only free the whole arena at once (one or more buffers).

It's ideal for anything where the lifetime of allocations is known to be roughly the same. E.g. a data structure you'll free all at once.