| Actually, when I first saw this, I set out to investigate. I managed to reproduce the same performance in an isolated environment and I has been sitting on a post on how this works for quite a time. Hopefully, I will have time to finish it this weekend. Yes, it is that fast. Long story short, the trick is a completely novel way of implementing interpreters. It is tree-based and I should stress that the tree is not an AST tree but a highly specialized tree-based representation produced from the AST tree. A node is basically equivalent to an "instruction" in a bytecode-based interpreter. Usual arguments against tree-based interpreters do not apply: there is no pointer-chasing, since the nodes sit tightly in the CPU cache. Dispatch overhead is just a virtual function call so should be comparable to a computed goto-based interpreter. The nodes operate on machine-native primitive types and the values are passed around between nodes in registers most of the time - oftentimes the interpeter doesn't even need to access the main memory. There is no encoding or decoding of values and no typechecking during evaluation. The interpreter even goes at lengths to avoid conversion of values in most paths, i.e. if your code operates on uint32_ts, they will be passed around as such up until a node that requires them to be cast, effectively creating typed execution paths. Before the interpreter there is also a "compilation" phase which performs a ton of optimizations, of which I guess the most interesting is the node fusion. As you may guess, it replaces smaller nodes with larger and more specialized ones. In my tests I implemented it for a naive AST-walker-like intepreter for a 3x performance improvement. I can go on but I better finish the post. I would also add that writing an interpreter in this style is far simpler than writing even a simplest JIT, but it does impose certain requirements on the language itself. All in all, the design is nothing short of genius. |