| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dobin 888 days ago
	I once had the idea to do malware-similarity analysis. The X86 should first be lifted into a IL, so it gets "normalized" (e.g. register independant). The problem with all lifters is though that even a trivial "add rax, 1" generated a lot of IL code (probably 50-100 lines in LLVM IL), as the lifter had to implement all side effects of the X86 instructions in a fake memory space (i used remill if i remember correctly). Does this lifter have a similar implementation, or will a "add rax, 1" be lifted to something like "register1 += 1"?

2 comments

aengelke 888 days ago

> The problem with all lifters is though that even a trivial "add rax, 1" generated a lot of IL code (probably 50-100 lines in LLVM IL)

Why is this a problem? The addition is one LLVM-IR instruction (add), followed by flag computation (maybe 10-20 instrs). Dead code elimination will afterwards quickly remove unused instructions (e.g., unused flags).

> register1 += 1

I don't see how this could be beneficial, especially on x86 where you can have "mov rax, rdx; add rax, 1" and "lea rax, [rdx + 1]", which do mostly the same (the former clobbers flags). SSA removes registers and shows the semantic operations clearly.

link

aleclm 888 days ago

I had some ideas about binary diffing, but it's a difficult topic and I'm too much of a noob in ML to get to something working in a decent time frame.

I think something ABI-, compiler- and architecture-agnostic would be super cool and I started to build a training data set.

I wouldn't diff individual instructions though, I'd go for something more highlevel, such as features of the CFG and type of operations in the nodes.

link

westurner 878 days ago

Ghidriff: Ghidra Binary Diffing Engine, ghidra-patchdiff-correlator: https://news.ycombinator.com/item?id=38870593

link