| One of the main benefits of Semgrep is its unified DSL that works across all supported languages. In contrast, using the Go module "smacker/go-tree-sitter" can expose you to differences in s-expression outputs due to variations and changes in independent grammars. I've seen grammars that are part of "smacker/go-tree-sitter" change their syntax between versions, which can lead to broken S-expressions. Semgrep solves that with their DSL, because it's also an abstraction away from those kind of grammar changes. I'm a bit concerned that tree-sitter s-expressions can become "write-only" and rely on the reader to also understand the grammar for which they've been generated. For example, here's a semgrep rule for detecting a Jinja2 environment with autoescaping disabled: rules:
- id: incorrect-autoescape-disabled
patterns:
- pattern: jinja2.Environment(... , autoescape=$VAL, ...)
- pattern-not: jinja2.Environment(... , autoescape=True, ...)
- pattern-not: jinja2.Environment(... , autoescape=jinja2.select_autoescape(...), ...)
- focus-metavariable: $VAL
Now, compare it to the corresponding tree-sitter S-expression (generated by o3-mini-high): (
call
function: (attribute
object: (identifier) @module (#eq? @module "jinja2")
attribute: (identifier) @func (#eq? @func "Environment"))
arguments: (argument_list
(_)*
(keyword_argument
name: (identifier) @key (#eq? @key "autoescape")
value: (_) @val
(#not-match @val "^True$")
(#not-match @val "^jinja2\\.select_autoescape\\("))
(_)*)
) @incorrect_autoescape
People can disagree, but I'm not sure that tree-sitter S-expressions as an upgrade over a DSL. I'm hoping I'm proven wrong ;-) |
The other benefit of a DSL like Semgrep's is that LLMs have become very good at generating it. See https://github.com/lambdasec/autogrep on how to automatically generate Semgrep rules from existing CVEs.