|
|
|
|
|
by DarkPlayer
589 days ago
|
|
I don't think that different algorithms are better for merging or diffing. In both cases, the first step is to match identical nodes, and the quality of the final result depends heavily on this step. The main problem with GumTree is that it is a greedy algorithm. One incorrectly matched node can completely screw up the rest of the matches. A typical example we encountered was adding a decorator to a function in Python. When other functions with the same decorator followed, the algorithm would often map the newly added decorator to an existing decorator, causing all other decorator mappings to be "off-by-one". GumTree has a tendency to come up with more changes than there actually are. We try to really get the diff quality nailed down before going after merges. We don't have merge functionallity in SemanticDiff yet. The main issue we have with tree-sitter is that the grammars are often written from scratch and not based on the upstream grammar definition. Sometimes they only cover the most likely cases which can lead to parsing errors or incorrectly parsed code. When you encounter parsing errors it can be difficult to fix them, because the upstream grammar is structured completely different. To give you an example, try to compare the tree-sitter Go grammar for types [1] with the upstream grammar [2]. It is similar but the way the rules are structured is somewhat inverted. We use separate executables for the parsers (this also helps to secure them using seccomp on Linux), and they all use the same JSON schema for their output. This allows us to write the parser executable in the most appropriate language for the target language. Building all them statically and cross-platform for our VS Code extension isn't easy though ;) [1]: https://github.com/tree-sitter/tree-sitter-go/blob/master/gr...
[2]: https://go.dev/ref/spec#Types |
|
- for diffing, the matching of the leaves is what matters the most, for merging the internal nodes are more important,
- for diffing, it feels more acceptable to restrict the matching to be monotonous on the leaves since it's difficult to visually represent moves if you can detect them. For merging, supporting moves is more interesting as it lets you replay changes on the moved element,
- diffing needs to be faster than merging, so the accuracy/speed tradeoffs can be different.
Packaging parsers into separate executables seems like hard work indeed! I assume you also considered fixing the tree-sitter grammars (vendoring them as needed, if the fixes can't be upstreamed)? Tree-sitter parsers are being used for a lot more than syntax highlighting these days (for instance GitHub's "Symbols" panel) so I would imagine maintainers should be open to making grammars more faithful to the official specs. I'm not particularly looking forward to maintaining dozens of forked grammars but it still feels a lot easier than writing parsers in different languages. I guess you have different distribution constraints also.