| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alexdima 3418 days ago

I hate slowness and inefficiency too, that's why I try to make the editor as fast as possible :), but at least in this case, it is not the dynamic nature of JS to blame, but rather the nature of TM grammars. TM grammars consist of rules that have regular expressions, which need to be constantly evaluated; and in order to implement a correct TM grammar interpreter, you must evaluate them.

I've looked in the past for optimization opportunities in the C land (mostly through better caching), which yielded quite nice results [1][2]. I would love if you'd want to take a look too.

At this point, in tokenization, 90% of the time is spent in C, matching regular expressions in oniguruma. More precisely, regular expressions are executed 3,933,859 times to tokenize checker.ts -- the 1.18MB file. That is with some very good caching in node-oniguruma and it just speaks to the inefficiency of the TM grammars regex based design, more than anything else.

It is definitely possible to write faster tokenizers, especially when writing them by hand (even in JS), see for example the Monaco Editor[3] where we use the TypeScript compiler's lexer as a tokenizer.

At least in this case, inefficiencies are not caused by our runtime.

[1] https://github.com/atom/node-oniguruma/pull/40

[2] https://github.com/atom/node-oniguruma/pull/46

[3] https://microsoft.github.io/monaco-editor/

1 comments

nhaehnle 3418 days ago

Do you pre-process the regular expressions into a common DFA, or does oniguruma do that for you? That would seem like the natural design for this.

It's non-trivial because TextMate grammar seem like they're just a little bit too general to be convenient. So there's definitely a trade-off. But if I wanted to really get as fast as possible, I would try to see if I can get there.

link