Hacker News new | ask | show | jobs
by danielvaughn 845 days ago
I'm not super familiar with the space, but tree-sitter seems to take an interesting approach in that they are an incremental parser. So instead of re-parsing the entire document on change, it only parses the affected text, thereby making it much more efficient for text editors.

I don't know if that's specific to tree-sitter though, I'm sure there are other incremental parsers. I have to say that I've tried ANTLR and tree-sitter, and I absolutely love tree-sitter. It's a joy to work with.

2 comments

In my experience incremental parsing doesn't really make much sense. Non-incremental parsing can easily parse huge documents in milliseconds.

Also Tree Sitter only does half the parsing job - you get a tree on nodes, but you have to do your own parse of that tree to get useful structures out.

I prefer Chumsky or Nom which go all the way.

Ah interesting, yeah I did spend quite a bit of time parsing their AST, which turned out to be harder than writing the grammar itself. I’ll look into those two projects.
What do you mean by “parse of that tree to get useful structures out”? Can you provide some concrete examples?
Yeah suppose you write a simple config language like:

  let a = 12;
  let b = a + 5;
  ...

Tree-Sitter will give you a tree like

   Node(type="file", range=..., children=[
     Node(name="let_item", range=... children=[
       Node(name="identifier", range=...)
       Node(name="expression", range=..., children=[
         Node(name="integer_literal", range=...)
   ...
Whereas Nom/Chumsky will give you:

    struct File {
      let_items: Vec<LetItem>,
      ..
    };
    struct LetItem {
      name: String,
      expression: Expression,
    };
    ...
Essentially Tree-Sitter's output is untyped, and ad-hoc, whereas Nom/Chumksy's is fully validated and statically typed.

In some cases Tree-Sitter's output is totally fine (e.g. for syntax highlighting, or rough code intelligence). But if you're going to want to do stuff with the data like actually process/compile it, or provide 100% accurate code intelligence then I think Nom/Chumksy make more sense.

The downsides of Nom/Chunksy are: pretty advanced Rust with lots of generics (error messages can be quite something!), and keeping track of source code spans (where did the `LetItem` come from) can be a bit of a pain, whereas Tree-Sitter does that automatically.

Ok, understood. I was confused by the phrase "parse of that tree".

Tree-sitter's output is closer to being "dynamic" than "untyped", though.

It's not too hard to build a layer on top of tree-sitter (out of the core lib) to generate statically typed APIs. I haven't felt the need for that yet, but it may be worth exploring.

> actually process/compile it

At work, I built a custom embedded DSL, using tree-sitter for parsing. It has worked well enough so far. The dynamically-typed nature of tree-sitter actually made it easier to port the DSL to multiple runtimes.

> provide 100% accurate code intelligence

Totally agree that tree-sitter cannot be used for this, if we are aiming for 100%.

Not the person you’re asking, but basically anything that needs to happen after the initial parsing stage. So you convert your raw text into an AST, but there’s usually some processing you need to do after that.

Maybe you need to optimize the data, maybe you need to do some error checking. Lots of code is syntactically valid but not semantically valid, and usually those semantic errors will persist into the AST (in my limited experience).

> [incremental parsing] I don't know if that's specific to tree-sitter though

No, it isn't. And incremental parsing is older than 2011 too (like at least the 70s).

For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066