|
|
|
|
|
by nhaehnle
3418 days ago
|
|
Thanks for this. Clearly the original post describes good work, but I can't help feeling the JS community is slacking off when it comes to performance. Just eyeballing the cited numbers, they take 3939ms to handle a 1.18MB input on "a somewhat powerful desktop machine". Assuming that that means a chip running at 2GHz, we're talking about over 6300 cycles per byte! That's quite frankly ridiculous. An improvement by at least one order of magnitude should be possible. Where's the ambition? (Yes, there's always a trade-off with these things. But I feel someone has to point this out when the OP is explicitly about getting kudos for performance work.) |
|
I've looked in the past for optimization opportunities in the C land (mostly through better caching), which yielded quite nice results [1][2]. I would love if you'd want to take a look too.
At this point, in tokenization, 90% of the time is spent in C, matching regular expressions in oniguruma. More precisely, regular expressions are executed 3,933,859 times to tokenize checker.ts -- the 1.18MB file. That is with some very good caching in node-oniguruma and it just speaks to the inefficiency of the TM grammars regex based design, more than anything else.
It is definitely possible to write faster tokenizers, especially when writing them by hand (even in JS), see for example the Monaco Editor[3] where we use the TypeScript compiler's lexer as a tokenizer.
At least in this case, inefficiencies are not caused by our runtime.
[1] https://github.com/atom/node-oniguruma/pull/40
[2] https://github.com/atom/node-oniguruma/pull/46
[3] https://microsoft.github.io/monaco-editor/