| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by boricj 642 days ago

> You can't use individual functions without knowing the data structures they operate on, and info about data structures is usually not present in the final binary.

It turns out you can. Linkers do not care about types or data structures, all they do is lay out sections (a fancy name for arrays of bytes) inside a virtual address space and fix up relocations.

I've written case studies on my blog where I've successfully carved out whole pieces out of a program and reuse them without reverse-engineering them. I've even made a native port of a ~100 KiB Linux a.out x86 proprietary program from 1995 to Windows and all I had to do was write 50 lines of C to thunk between the glibc 1.xx-flavored object file and the MinGW runtime.

One user of my tooling managed to delink a 7 MiB executable for a 2009 commercial Windows video game in six weeks (including the time required to fix all the bugs in my alpha-quality COFF exporter and x86 analyzer), leaving out the C standard library. They then relinked it at a different base address and the new executable works so well it's functionally indistinguishable from the original one. They didn't reverse-engineer the thousands of functions or the hundreds of kilobytes of data inside that program to do so.

This is complete heresy according to conventional computer sciences, which is why you can't apply it here. I'd be happy to talk at length about it, but without literature on the topic I'm forced to explain this from basic principles every time and Hacker News isn't the place to write a whole thesis.

1 comments

karpierz 642 days ago

The question on my mind: how do you figure out what the functions do without reverse engineering?

If I were to guess, you're saying that you reverse engineer the API boundary without reverse engineering the implementation. But then figuring out what the API contact is without documentation seems intractable for most API boundaries.

boricj 642 days ago

For context, my tooling is a Ghidra extension, so there's all the usual Ghidra stuff that applies here.

Indeed, it depends what the API boundary is for the selection to be exported.

If it's the whole program without some well-known libraries (like the C runtime library for a statically linked executable or the Psy-Q SDK for a PlayStation video game), then the API boundary is trivial in the sense that it's a documented one. The hard part is actually figuring where those libraries are so that you can cut them out of your selection while exporting.

If it's a particular subset internal to the program then it's trickier because you don't have that (but if you know you want to export it, then you must already know something about it). Traditional reverse-engineering techniques do apply here, it's just that you only care about the ABI boundary instead of the whole implementation, so it's usually less work to figure out.

However, if you get it wrong then the toolchain will usually not detect it and you'll have some very exotic undefined behavior on your hands when you try to run your Frankensteinized program. Troubleshooting these issues can be very tricky because my tooling doesn't generate debugging symbols, so the debugging experience is atrocious.

I've always managed to muddle through so far, but one really nasty case did take me a couple of weekends to track down (don't cut across a variable when you're exporting because you'll truncate it, which can lead among other things to corrupting whatever was laid out next in memory at runtime when the original delinked code tries to write to it).