|
|
|
|
|
by lolinder
819 days ago
|
|
I think you're misunderstanding OP's objection. It's not simply a matter of going back and forth with the LLM until eventually (infinite monkeys on typewriters style) it gets the same binary as before: Even if you got the exact same source code as the original there's still no automated way to tell that you're done because the bits you get back out of the recompile step will almost certainly not be the same, even if your decompiled source were identical in every way. They might even vary quite substantially depending on a lot of different environmental factors. Reproducible builds are hard to pull off cooperatively, when you control the pipeline that built the original binary and can work to eliminate all sources of variation. It's simply not going to happen in a decompiler like this. |
|
The critical piece is that this can be done in training. If I collect a large number of C programs from github, compile them (in a deterministic fashion), I can use that as a training, test, and validation set. The output of the ML ought to compile to the same way given the same environment.
Indeed, I can train over multiple deterministic build environments (e.g. different compilers, different compiler flags) to be even more robust.
The second critical piece is that for something like a GAN, it doesn't need to be identical. You have two ML algorithms competing:
- One is trying to identify generated versus ground-truth source code
- One is trying to generate source code
Virtually all ML tasks are trained this way, and it doesn't matter. I have images and descriptions, and all the ML needs to do is generate an indistinguishable description.
So if I give the poster a lot more benefit of the doubt on what they wanted to say, it can make sense.