Hacker News new | ask | show | jobs
by AshamedCaptain 53 days ago
Decompilation to C (and even C++!) has been done automatically for 2-3 decades at least. I am not sure what has changed in recent years other than people playing fast and loose with copyright (and GitHub allowing it, likely because their LLMs also stand to benefit). Introducing LLMs here is only going to introduce errors, delays and likely push you away from a reliable result.

The challenge here is readability. Reading the TP source leak you link I think it's even behind the current state of the art, as it's barely above assembly. This is where I suspect even the smallest of LLMs may help, since you don't care that much if it introduces errors.

1 comments

>Decompilation to C (and even C++!) has been done automatically for 2-3 decades at least.

Only in a very rudimentary sense and definitely not in a working compilation (much less binary equivalent) sense. LLMs have turned this from a gimmick for static analysis into something that actually works pretty well for recompilation projects.

> Only in a very rudimentary sense and definitely not in a working compilation (much less binary equivalent) sense.

Working is the easy part; the hard part is getting something that classifies as readable C. LLMs do not really help reach the "working compilation" part but benefit from it.

We are way past "working compilation" when it comes to LLMs. They are already really good at writing readable, compliable code. The big problem with LLMs is making sure the output binary actually does what you wanted it to do. But if you define the goal not merely as instructions in a vague, unspecific human language and rather as recreating a given set of binary instructions after compilation, this big drawback goes away. So in a sense they are better suited for recompilation projects than for developing new applications.
My point is that we have been past the "working compilation" way before LLMs, and I do not think anything in LLMs help with it, at best agents use these tools with the same efficiency. I disagree that they're good at writing compilable code, but agree on the readable part.
Which decompiler reliably produced working, high level C/C++ from assembly? I would have loved to use this thing you are describing here 15 years ago. Compilation is inherently lossy, so any system that could have given you this would have needed pretty heavy LLM-like features anyways.

>I disagree that they're good at writing compilable code

That was never part of the discussion, because as explained several times now it is irrelevant in this case. The existence of the original binary means all you need to do is match up things, which can be automated completely.

I do not understand what is it so hard to "generate working code". Even the free version of Hexrays was doing it 15 years ago, and I have written one in my company that I have used for over 30 years. It's actually ... trivial?

The problem is readability. No one in his right mind would call what they generate "C++". Mine still interjects assembler from time to time (and not the new version that GCC supports, but the older MSVC style) .

LLMs absolutely do not help with the generate working code part, because this is an exact problem that doesn't need nor benefit from an LLM (other than maybe automating stupid iteration?). They can help with the readability part, because here once you already have a working skeleton it doesn't matter that much if they make mistakes, as it is easy to detect.