Hacker News new | ask | show | jobs
by hyper_reality 1887 days ago
This is an excellent tool to have as a security consultant, and it just keeps getting better and better. When approaching a large codebase, it enables you to write custom rules that match on certain antipatterns you've spotted that may be unique to the codebase. That's the real value of the tool, but the repository of per-language rules is also convenient for quickly finding low-hanging fruit (like every use of a potentially injectable function such as exec,system,etc. in PHP).

For example, a webapp may have been designed such that authorisation needs to be explicitly added with a line or two to each controller. A semgrep rule can be written to match all the controllers which are missing this line. Then these controllers can be manually reviewed to assess whether unauthorised access should be allowed. Depending on what you are trying to match, this is something that may be very complex or even impossible to implement accurately in plain grep. Some languages like Ruby have powerful static analysis tools (Brakeman) that can also do this, but the benefit of Semgrep is the flexibility across multiple languages and how readable the rulesets are. [1]

[1] https://blog.includesecurity.com/2021/01/custom-static-analy...

5 comments

I agree it is an excellent tool. At GitLab we released our Semgrep integration today https://news.ycombinator.com/item?id=26903114

"GitLab SAST historically has been powered by over a dozen open-source static analysis security analyzers. These analyzers have proactively identified millions of vulnerabilities for developers using GitLab every month. Each of these analyzers is language-specific and has different technology approaches to scanning. These differences produce overhead for updating, managing, and maintaining additional features we build on top of these tools, and they create confusion for anyone attempting to debug.

The GitLab Static Analysis team is continuously evaluating new security analyzers. We have been impressed by a relatively new tool from the development team at r2c called Semgrep. It’s a fast, open-source, static analysis tool for finding bugs and enforcing code standards. Semgrep’s rules look like the code you are searching for; this means you can write your own rules without having to understand abstract syntax trees (ASTs) or wrestle with regexes.

Semgrep’s flexible rule syntax is ideal for streamlining GitLab’s Custom Rulesets feature for extending and modifying detection rules, a popular request from GitLab SAST customers. Semgrep also has a growing open-source registry of 1,000+ community rules.

We are in the process of transitioning many of our lint-based SAST analyzers to Semgrep. This transition will help increase stability, performance, rule coverage, and allow GitLab customers access to Semgrep’s community rules and additional custom ruleset capabilities that we will be adding in the future. We have enjoyed working with the r2c team and we cannot wait to transition more of our analyzers to Semgrep. You can read more in our transition epic, or try out our first experimental Semgrep analyzers for JavaScript, TypeScript, and Python.

We are excited about what this transition means for the future of GitLab SAST and the larger Semgrep community. GitLab will be contributing to the Semgrep open-source project including additional rules to ensure coverage matches or exceeds our existing analyzers."

I'd be careful with how much of a warm fuzzy the tool gives you. See this example from my other comment in the thread: https://news.ycombinator.com/item?id=26905880 If it were really looking at AST level data, that wouldn't have fooled it.

I suspect there would be similar issues with your example of ensuring no use of eval() in PHP. So it seems okay to keep your own developers informed, but I wouldn't use it, alone, to vet outside code. PHP has eval-like functionality buried in preg_replace(), assert(), and probably other places. This tool also doesn't seem to dig into namespaced "aliases".

> If it were really looking at AST level data, that wouldn't have fooled it.

Semgrep does look at an AST; but that counterexample is not something you can "fix" solely by looking at an AST. You need actual Python-specific semantic analysis that knows that all "open" functions like 'print' come from the builtins module, and thus are bound to the same identifier. They're literally built into the implementation, it's not something you can "discover" from analyzing existing Python source. Even if you had a perfectly accurate python AST it couldn't "tell" you this fact, it's a priori knowledge, and all analysis engines need a base set of facts like this that they work from.

> but I wouldn't use it, alone, to vet outside code.

I mean, nobody seems to be suggesting this though, and the OP quite literally stated the major value of the tool is enforcing domain/codebase-specific rules among a team. Which is a really good use for it! There are tons of little useful patterns you can codify this way.

>Semgrep does look at an AST; but that counterexample is not something you can "fix" solely by looking at an AST

Perhaps I worded it poorly. Dumping the python AST for builtins.print() makes it pretty clear that it's "print" though. So I'm curious why that skirts the rule.

>I mean, nobody seems to be suggesting this though

Not specificially, but the context is using it for security purposes with phrases like "every use of a potentially injectable function such as exec,system,etc. in PHP". Felt like that was worth commenting on.

> Dumping the python AST for builtins.print() makes it pretty clear that it's "print" though.

No, the AST only tells you it's a method call on something called "builtins." You need the separate semantic knowledge of what builtins is in order to figure it out. Parsing + AST just means it sees "method call of `print` on `builtins` object". Regular print calls would come through as "regular function call of `print`".

> Perhaps I worded it poorly. Dumping the python AST for builtins.print() makes it pretty clear that it's "print" though. So I'm curious why that skirts the rule.

To echo the other replies, the AST for builtins.print() is the same as the ast for mymodule.print() and, in fact, if you stick a builtins.py in the right place, you'll be able to prevent the import of the standard library builtin module, while the ast's would be identical.

I'm honestly surprised we haven't seen complete democratization of linting/grepping/refactoring tools. Imagine being able to script a one-off refactor you want to make that's too project-specific to be included in the IDE itself.

Writing the parser is nontrivial, but once you have it it should be straightforward to expose a programmatic API for doing this stuff instead of trying to hardcode every useful linter rule into a single program.

We do this with rubocop and its ability to parse the ruby AST. The API to actually write rules in rubocop is not particularly the best, but it has been highly successful in codebase and domain-specific ruby linting and autofixing.
You can do this with Rust using a combination of rustc, syn and a build.rs file. Then you can execute more Rust code on the parsed AST.

In the JVM world annotation processors or compiler plugins can have access to the AST during compilation. Both in Kotlin and Java.

For C/C++ code, you can already do refactoring using clang-tidy scripts [0], or even can write custom linters using libtooling [1] and leverage the AST Matchers [2] which work at the AST level.

All that's needed is a compile_commands.json file which can be easily generated via most build systems, or you can use Bear [3]/some other tool (or write a script that logs all syscalls and generate it yourself).

[0] https://releases.llvm.org/12.0.0/tools/clang/tools/extra/doc...

[1] https://releases.llvm.org/12.0.0/tools/clang/docs/LibTooling...

[2] https://releases.llvm.org/12.0.0/tools/clang/docs/LibASTMatc...

[3] https://github.com/rizsotto/Bear

I wrote a small VS code extension and pre-commit hook that might meet 80% of your needs:

https://github.com/elanning/checkr

It is just simple regex at this time, but hopefully I can add something like CCGrep syntax in the future:

https://github.com/yuy-m/CCGrep

Writing a parser for one specific version of a language is one thing (nontrivial), but writing a parser than parsed most versions of most programming languages and keeping it up-to-date is a enormous undertaking and cannot be done by one person
Isn't a problem here that, since there's no simple definition of what it's doing, there's no simple way to assess your false negative rate when searching for something? That is fine for many pragmatic informal uses but doesn't seem a good fit for security purposes.
Python you can probably do that just from the standard library. ast.parse and all that.
Perhaps, but this tool has a syntax and common interface for multiple languages. This is a huge resource for IT/security. I could imagine this being a helpful refactoring tool as well for CLI.
No need to overstate. It's fine, but the rules seem to need to be very bespoke, and any language's parser will do a better job of figuring out syntax.