Hacker News new | ask | show | jobs
by lolinder 819 days ago
I think you're misunderstanding OP's objection. It's not simply a matter of going back and forth with the LLM until eventually (infinite monkeys on typewriters style) it gets the same binary as before: Even if you got the exact same source code as the original there's still no automated way to tell that you're done because the bits you get back out of the recompile step will almost certainly not be the same, even if your decompiled source were identical in every way. They might even vary quite substantially depending on a lot of different environmental factors.

Reproducible builds are hard to pull off cooperatively, when you control the pipeline that built the original binary and can work to eliminate all sources of variation. It's simply not going to happen in a decompiler like this.

2 comments

Well, no, but yes.

The critical piece is that this can be done in training. If I collect a large number of C programs from github, compile them (in a deterministic fashion), I can use that as a training, test, and validation set. The output of the ML ought to compile to the same way given the same environment.

Indeed, I can train over multiple deterministic build environments (e.g. different compilers, different compiler flags) to be even more robust.

The second critical piece is that for something like a GAN, it doesn't need to be identical. You have two ML algorithms competing:

- One is trying to identify generated versus ground-truth source code

- One is trying to generate source code

Virtually all ML tasks are trained this way, and it doesn't matter. I have images and descriptions, and all the ML needs to do is generate an indistinguishable description.

So if I give the poster a lot more benefit of the doubt on what they wanted to say, it can make sense.

Oh, I was assuming that Eager was responding to klik99's question about how we could identify hallucinations in the output—round tripping doesn't help with that.

If what they're actually saying is that it's possible to train a model to low loss and then you just have to trust the results, yes, what you say makes sense.

I haven't found many places where I trust the results of an ML algorithm. I've found many places where they work astonishingly well 30-95% of the time, which is to say, save me or others a bunch of time.

It's been years, but I'm thinking back through things I've reverse-engineered before, and having something which kinda works most of the time would be super-useful still as a starting point.

Have you ever trained a GAN?
Technically, yes!

A more reasonable answer, though, is "no."

I've technically gone through random tutorials and trained various toy networks, including a GAN at some point, but I don't think that should really count. I also have a ton of experience with neural networks that's decades out-of-date (HUNDREDS of nodes, doing things like OCR). And I've read a bunch of modern papers and used a bunch of Hugging Face models.

Which is to say, I'm not completely ignorant, but I do not have credible experience training GANs.

That's true but a solvable problem. I once tried to reproduce the build of an uncooperative party and it was mainly tedious and boring.

The space of possible compiler arguments is huge, but ultimately what is actually used is mostly on a small surface.

Apart from that, I wrote a small tool to normalize the version string, timestamps and file path' in the binaries before I compared them. I know there are other sources of non-determinism, but these three things were enough in my case.

The hardest part were the numerous file path' from the build machine. I had not expected that. In hindsight, stripping both binaries before comparison might have helped, but I don't remember why I didn't do that.