Hacker News new | ask | show | jobs
by boricj 642 days ago
No, it's actually ripping out bits from an executable and turning them into relocatable object files. The technical term I've come up with for this is delinking, although the decompilation community calls it binary splitting.

Putting it another way, you can make libraries out of a program with this technique.

3 comments

Anyone with the patience to read through an existing github repo can do this? If there are valuable bits to be salvaged it's easy enough to copy-paste them into a new library if you so choose.
Well, it's one of those things where you know the rules so well that you can break them, but my Ghidra extension is indeed on GitHub: https://github.com/boricj/ghidra-delinker-extension

Unless you meant copying source code from GitHub, which isn't an option if the program is a 1998 PlayStation video game and you don't have the source code for it.

Yikes. I can't imagine the circumstances where this would be a good option for software that's going to be used beyond a quick one-shot job.
It's very useful if you don't have access to the original source code.

You can do things like decompiling a program piece by piece like the Ship of Theseus. The linker will mend both original and rewritten parts together into a working executable at each step.

If you change the functionality of the rewritten parts, you have a modding framework that's far more malleable than what traditional binary patching allows.

As for quick one-shot jobs, I've created an assert extractor for a PlayStation video game by stuffing its archive code inside a Linux MIPS executable and calling into it, without figuring out the proprietary archive format or how the code inside the delinked object file actually works.

Is that meaningfully different from reverse engineering? You can't use individual functions without knowing the data structures they operate on, and info about data structures is usually not present in the final binary*.

* (excluding languages that compile to an IL like C# of course, but decompiling C# is trivial)

> You can't use individual functions without knowing the data structures they operate on, and info about data structures is usually not present in the final binary.

It turns out you can. Linkers do not care about types or data structures, all they do is lay out sections (a fancy name for arrays of bytes) inside a virtual address space and fix up relocations.

I've written case studies on my blog where I've successfully carved out whole pieces out of a program and reuse them without reverse-engineering them. I've even made a native port of a ~100 KiB Linux a.out x86 proprietary program from 1995 to Windows and all I had to do was write 50 lines of C to thunk between the glibc 1.xx-flavored object file and the MinGW runtime.

One user of my tooling managed to delink a 7 MiB executable for a 2009 commercial Windows video game in six weeks (including the time required to fix all the bugs in my alpha-quality COFF exporter and x86 analyzer), leaving out the C standard library. They then relinked it at a different base address and the new executable works so well it's functionally indistinguishable from the original one. They didn't reverse-engineer the thousands of functions or the hundreds of kilobytes of data inside that program to do so.

This is complete heresy according to conventional computer sciences, which is why you can't apply it here. I'd be happy to talk at length about it, but without literature on the topic I'm forced to explain this from basic principles every time and Hacker News isn't the place to write a whole thesis.

The question on my mind: how do you figure out what the functions do without reverse engineering?

If I were to guess, you're saying that you reverse engineer the API boundary without reverse engineering the implementation. But then figuring out what the API contact is without documentation seems intractable for most API boundaries.

For context, my tooling is a Ghidra extension, so there's all the usual Ghidra stuff that applies here.

Indeed, it depends what the API boundary is for the selection to be exported.

If it's the whole program without some well-known libraries (like the C runtime library for a statically linked executable or the Psy-Q SDK for a PlayStation video game), then the API boundary is trivial in the sense that it's a documented one. The hard part is actually figuring where those libraries are so that you can cut them out of your selection while exporting.

If it's a particular subset internal to the program then it's trickier because you don't have that (but if you know you want to export it, then you must already know something about it). Traditional reverse-engineering techniques do apply here, it's just that you only care about the ABI boundary instead of the whole implementation, so it's usually less work to figure out.

However, if you get it wrong then the toolchain will usually not detect it and you'll have some very exotic undefined behavior on your hands when you try to run your Frankensteinized program. Troubleshooting these issues can be very tricky because my tooling doesn't generate debugging symbols, so the debugging experience is atrocious.

I've always managed to muddle through so far, but one really nasty case did take me a couple of weekends to track down (don't cut across a variable when you're exporting because you'll truncate it, which can lead among other things to corrupting whatever was laid out next in memory at runtime when the original delinked code tries to write to it).