Hacker News new | ask | show | jobs
by k4st 1330 days ago
At Trail of Bits, we are creating a new compiler front/middle end for Clang called VAST [1]. It consumes Clang ASTs and creates a high-level, information-rich MLIR dialect. Then, we progressively lower it through various other dialects, eventually down to the LLVM dialect in MLIR, which can be translated directly to LLVM IR.

Our goals with this pipeline are to enable static analyses that can choose the right abstraction level(s) for their goals, and using provenance, cross abstraction levels to relate results back to source code.

Neither Clang ASTs nor LLVM IR alone meet our needs for static analysis. Clang ASTs are too verbose and lack explicit representations for implicit behaviours in C++. LLVM IR isn't really "one IR," it's a two IRs (LLVM proper, and metadata), where LLVM proper is an unspecified family of dialects (-O0, -O1, -O2, -O3, then all the arch-specific stuff). LLVM IR also isn't easy to relate to source, even in the presence of maximal debug information. The Clang codegen process does ABI-specific lowering takes high-level types/values and transforms them to be more amenable to storing in target-cpu locations (e.g. registers). This actively works against relating information across levels; something that we want to solve with intermediate MLIR dialects.

Beyond our static analysis goals, I think an MLIR-based setup will be a key enabler of library-aware compiler optimizations. Right now, library-aware optimizations are challenging because Clang ASTs are hard to mutate, and by the time things are in LLVM IR, the abstraction boundaries provided by libraries are broken down by optimizations (e.g. inlining, specialization, folding), forcing optimization passes to reckon with the mechanics of how libraries are implemented.

We're very excited about MLIR, and we're pushing full steam ahead with VAST. MLIR is a technology that we can use to fix a lot of issues in Clang/LLVM that hinder really good static analysis.

[1] https://github.com/trailofbits/vast

3 comments

How is this different from Facebook's CIR?

https://www.phoronix.com/news/Meta-Developing-Clang-IR-CIR

I'm also curious, especially they seem to be gearing towards analysis as well.

https://github.com/llvm/clangir/blob/main/clang/lib/CIR/Dial...

These are great questions! We think our approaches are complimentary. We think that CIR, or ClangIR, can be a codegen target from a combination of our high-level IR and our medium-level IR dialects.

Our understanding of ClangIR is that it has side-stepped the problem of trying to relate/map high-level values/types to low-level values/types -- a process which the Clang codegen brings about when generating LLVM IR. We care about explicitly representing this mapping so that there is data flow from a low-level (e.g. LLVM) representation all the way back up to a high-level. There's a lot of value and implied semantics in high-level representations that is lost to Clang's codegen, and thus to the Clang IR codegen. The distinction between `time_t` and `int` is an example of this. We would like to be able to see an `i32` in LLVM and follow it back to a `time_t` in our high-level dialect. This is not a problem that ClangIR sets out to solve. Thus, ClangIR is too low level to achieve some of our goals, but it is also at the right level to achieve some of our other goals.

> LLVM dialect in MLIR, which can be translated directly to MLIR

Should be:

LLVM dialect in MLIR, which can be translated directly to LLVM IR

Otherwise, great project! We're also using MLIR internally and it's been awesome, game-changing even when considering how much can be accomplished with a reasonable amount of effort.

Typo fixed! Thanks :-)

I think the next big problems for MLIR to address are things like: metadata/location maintenance when integrating with third-party dialects and transformations. With LLVM optimizations, getting the optimization right has always seemed like the top priority, and then maybe getting metadata propagation working came a distant second.

I think the opportunity with MLIR is that metadata/location info can be the old nodes or other dialects. In our work, we want a tower/progression of IRs, and we want them simultaneously in memory, all living together. You could think of the debug metadata for a lower level dialect being the higher level dialect. This is why I sometimes think about LLVM IR as really being two IRs: LLVM "code" and metadata nodes. Metadata nodes in LLVM IR can represent arbitrary structures, but lack concrete checks/balances. MLIR fixes this by unifying the representations, bringing in structure while retaining flexibility.

Is reusing tool names in the same sub-field of Computer Science becoming a thing? VAST was a Fortran preprocessor.
Good timing: just got an invite to attend VAST's breakfast discussion at the SC (supercomputing) conference.

HN: two net downvotes for asking a question.