| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by inferiorhuman 2316 days ago
	The Rust version probably could be made to work at an equivalent speed with enough effort. But at a high-level, Go was much more enjoyable to work with. This is a side project and it has to be fun for me to work on it. The Rust version was actively un-fun for me, both because of all of the workarounds that got in the way and because of the extremely slow compile times. Obviously you can tell from the nature of this project that I value fast build times :) Was the Rust parser written by hand or did you use one of the parser frameworks (e.g. nom or pest) out there? nom, for instance, goes to great lengths to be zero-copy which would probably be a big benefit here.

1 comments

constexpr 2316 days ago

Both the Rust and Go parsers were written by hand. They are also very similar (basically the Go version was a direct port of the Rust version) so the performance should be very comparable.

I assume by zero-copy you mean that identifiers in the AST are slices of the input file instead of copies? I was also careful to do this in both the Go and Rust versions. It's somewhat complicated because some JavaScript identifiers can technically have escape sequences (e.g. "\u0061bc" is the identifier "abc"), which require dynamic memory allocation anyway. See "allocatedNames" in the current parser for how this is handled.

Note that strings aren't slices of the input file because JavaScript strings are UTF-16, not UTF-8, and can have unpaired surrogates. So I represent string contents as arrays of 16-bit integers instead of 8-bit slices (in both Go and Rust).

In the past I tried using WTF-8 encoding (https://simonsapin.github.io/wtf-8/) for string contents, since that can both represent slices of the input file while also handling unpaired surrogates, but I ended up removing it because it complicated certain optimizations. I think the main issue was having to reason through weird edge cases such as constant folding of string addition when two unpaired surrogates are joined together. I think it's still possible to do this but I'm not sure how much of a win it is.

inferiorhuman 2316 days ago

They are also very similar (basically the Go version was a direct port of the Rust version) so the performance should be very comparable.

Sure, but different approaches are going to be more optimal for different languages.

I assume by zero-copy you mean that identifiers in the AST are slices of the input file instead of copies?

Yes. From the README:

zero-copy: if a parser returns a subset of its input data, it will return a slice of that input, without copying

Geal also makes claims that nom is faster than hand-written C parsers.

It's somewhat complicated because some JavaScript identifiers can technically have escape sequences (e.g. "\u0061bc" is the identifier "abc"), which require dynamic memory allocation anyway.

Nom comes with 'escaped' and 'escaped_transform' combinators. In theory it should be possible, with relative ease, to return a slice if there are no escape characters and an allocated string if expansion is required. Presumably you'd have to use a Cow<str> though.

Note that strings aren't slices of the input file because JavaScript strings are UTF-16, not UTF-8, and can have unpaired surrogates. So I represent string contents as arrays of 16-bit integers instead of 8-bit slices (in both Go and Rust).

Of course it is. My opinion (which is worth what you've paid for it) is that I'd just go for UTF-8 support. I can't remember the last time I've seen UTF-16 in the wild (thankfully).

Performance-wise the other thing that I'd keep in mind with rust is that in debug mode string handling is painfully slow.

Edit: here's the URL for nom: https://github.com/Geal/nom