|
|
|
|
|
by jcranmer
811 days ago
|
|
One of the blog posts I keep meaning to write but never quite get around to is a post that C is not portable assembly. What is necessary is decompilation to a portable C-like assembly, but that target is not C, and I think focusing on creating valid C tends to drag you towards suboptimal decisions, even leaving aside issues like "should SLL decompile to x << y or x << (y % 32)?" In my experience with Ghidra, I've just seen far too many times where Ghidra starts with wrong types for something and the result becomes gibberish--even just plain dropping stuff altogether. There are some cases where it's clear it's just poor analysis on Ghidra's part (e.g., it doesn't seem to understand stack slot reuse, and memcpy-via-xmm is very confusing to it). And Ghidra's type system lacks function pointer types, which is very annoying when you're doing vtable-heavy C++ code. I do like the appeal of a recompileable target language. But that language need not be C--in fact, I'm actually sketching out the design of such a language for my own purposes in being able to read LLVM IR without going crazy (which means I need to distinguish between, e.g., add nuw and just plain add). Analysis necessarily involves multiple levels. Given that a lot of the type analysis today tends to be crap, I'd rather prefer to have the ability to see a more solid first-level analysis that does variable recovery and works out function calling conventions so that it can inform my ability to reverse engineer structures or things like "does this C++ method return a non-trivial struct that is an implicit first parameter?" (Also, since I'm largely looking at C++ code in practice, I'd absolutely love to be able to import C++ header files to fill in known structure types.) |
|
I think this a bit of a misguided question. The hardware has a precise semantic defined, usually. QEMU's << behaves similarly to C (undefined behavior for rhs > 32), but this means that the lifter (still QEMU) will account for this and emit code preserving the semantics.
tl;dr: the code we emit should do the right thing depending on what the original instruction did, without making assumptions on what happens in case of C undefined behaviors.
> Ghidra's type system lacks function pointer types
Weird limitation, we support those.
> it doesn't seem to understand stack slot reuse
That's a tricky one. We're now re-designing certain parts of the pipeline to enable LLVM to promote stack accesses to SSA values, which basically solves the stack slot reuse. This is probably one of the most important features experienced reversers ask for.
> that language need not be C--
Making up your own language is temptation one should resist.
Anyway, we're rewriting our backend using an MLIR dialect (we call it clift) which targets C but should be good enough to emit something "similar to C but slightly different". It might make sense to have a different backend there. But a "standard C" backend has to be the first use case.
We thought about emitting C++, it would make our life simpler. But I think targeting non-C as the first and foremost backend would be a mistake.
Also, a Python backend would be cool.
> Analysis necessarily involves...
I would be interested in discussing more what exactly you mean here. Why don't you join our discord server?
> I'd absolutely love to be able to import C++ header files to fill in known structure types
We have a project for importing from header files. Basically we want use a compiler to turn them into DWARF debug symbols and then import those. Not too hard.