Hacker News new | ask | show | jobs
by trishume 3422 days ago
Shameless plug: my implementation of Sublime's syntax highlighting engine in Rust has similar optimizations and more. I'm not at my computer to benchmark on the same files but it should be >2x as fast as their "after" numbers just based on lines/second for JS-like files.

This evening I'm even trying to port it to a pure Rust regex engine that should eliminate non-Rust code and make it substantially faster.

It also implements the sublime-syntax format which is a superset of tmlanguage that allows even nicer highlighting.

https://github.com/trishume/syntect

4 comments

So I ran the sqlite3.c benchmark on my machine (a comparable "somewhat powerful" machine) and it took 6.7s with my engine vs 10.9s with the new VSCode one. Both are doing tokenization+theme matching but the machines and exact grammars are not necessarily the same. I'm using ST3's C grammar but they are using a different one. It could be that I ran it on a faster computer with an easier grammar, or it could be the opposite. It's close enough I'm not willing to claim my engine is substantially faster.

For context Sublime Text 3 takes 2 seconds with the same grammar and same file on the same computer due to using a better custom regex engine written specifically for highlighting.

Given what alexdima mentioned in a different comment about spending most of the time in the regex engine, I'm not sure that my engine would be substantially faster under exactly identical conditions since I'm also bottlenecked by Oniguruma.

However, maybe after I port my engine to https://github.com/google/fancy-regex I'll be substantially faster. And if I do it is likely they could also benefit from a fancy-regex port.

Thanks for this. Clearly the original post describes good work, but I can't help feeling the JS community is slacking off when it comes to performance.

Just eyeballing the cited numbers, they take 3939ms to handle a 1.18MB input on "a somewhat powerful desktop machine". Assuming that that means a chip running at 2GHz, we're talking about over 6300 cycles per byte!

That's quite frankly ridiculous. An improvement by at least one order of magnitude should be possible. Where's the ambition?

(Yes, there's always a trade-off with these things. But I feel someone has to point this out when the OP is explicitly about getting kudos for performance work.)

I hate slowness and inefficiency too, that's why I try to make the editor as fast as possible :), but at least in this case, it is not the dynamic nature of JS to blame, but rather the nature of TM grammars. TM grammars consist of rules that have regular expressions, which need to be constantly evaluated; and in order to implement a correct TM grammar interpreter, you must evaluate them.

I've looked in the past for optimization opportunities in the C land (mostly through better caching), which yielded quite nice results [1][2]. I would love if you'd want to take a look too.

At this point, in tokenization, 90% of the time is spent in C, matching regular expressions in oniguruma. More precisely, regular expressions are executed 3,933,859 times to tokenize checker.ts -- the 1.18MB file. That is with some very good caching in node-oniguruma and it just speaks to the inefficiency of the TM grammars regex based design, more than anything else.

It is definitely possible to write faster tokenizers, especially when writing them by hand (even in JS), see for example the Monaco Editor[3] where we use the TypeScript compiler's lexer as a tokenizer.

At least in this case, inefficiencies are not caused by our runtime.

[1] https://github.com/atom/node-oniguruma/pull/40

[2] https://github.com/atom/node-oniguruma/pull/46

[3] https://microsoft.github.io/monaco-editor/

Do you pre-process the regular expressions into a common DFA, or does oniguruma do that for you? That would seem like the natural design for this.

It's non-trivial because TextMate grammar seem like they're just a little bit too general to be convenient. So there's definitely a trade-off. But if I wanted to really get as fast as possible, I would try to see if I can get there.

It's harder than it sounds if you want to support many languages. The Sublime syntaxes repo I use has 34,000 lines of grammars whereas my engine is only 3000 lines of code. If you count all the tmLanguage files for nice languages available online it's probably hundreds of thousands of lines, and that's in a pretty dense format. The whole point of using tmLanguage files is that people don't care about how fast other languages are if there is no highlighting for their language.

I could get way better performance by rewriting all those grammars using compiled parsers in Rust (like Xi has as an option https://github.com/google/xi-editor/blob/master/rust/lang/Ca...) but it would take an absurd amount of effort.

The speed of highlighting tmLanguage files is limited mostly by the regex library. My code spends 50% of its time in Oniguruma. You can improve that a bit by using a fancy regex engine, which is how Sublime is faster than my engine, but this evening I'm going to try to port my library to a faster engine based on Rust's regex crate. But that library (https://github.com/google/fancy-regex) is brand new and almost untested. It's the only open source library that is faster than Oniguruma while supporting all the right features. In the future VSCode may be able to gain some speed by switching to it, but they have much more overhead over the regex library to eat away first than I do.

Reminds me of an "efficient programming" exercise back at university.

We had to implement getuid [0] as an executable in C, which just looks up /etc/passwd by name and returns a number.

By using specialised trie-like data structures and, in the end, also custom memory alignment instead of malloc, we squeezed the whole algorithm runtime down to an average of 45 cycles or so.

[0] http://man7.org/linux/man-pages/man2/getuid.2.html

The dynamic languages people are always saying programmer time is more important than computer time. Yet they put out an editor that is wasteful with both.
Sorry but I'm not familiar with Rust -- is it possible to run your lib in all platforms in a Node environment?

And, do you have an issue for the regex port? Would love to see the benchmark.

Yes it would be possible to write a C wrapper for my Rust library and link that from Node, however I expect you might lose a lot of performance to data conversion.

I have yet to do the regex engine port, I'm doing that later this evening. I'm porting it from Oniguruma to https://github.com/google/fancy-regex which accelerates common types of regexes using the awesome Rust regex crate which is super fast.

Doesn't rust export the C abi? Or does the "c wrapper" just involve converting c types to rust domain types>
You can very easily export a C api from rust. And fact you would have to if you wanted to write a wrapper anyways so I think the wrapper is probably unnecessary.
Sorry that's what I meant. Just export c ABI functions in a Rust crate.
That moment when someone else already did your hobby project :congrats through tears: