| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by p1esk 2256 days ago

I have the source code for the entire Linux kernel as training data, so it's not an issue with quantity

Linux kernel is only ~30M LOC. That's a really small dataset. For comparison, the reddit based dataset for GPT-2 is 100 times larger. Try using all C code posted on Github.

decompile x64 assembly

You can't "decompile" assembly. Either you decompile machine code, or you disassemble assembly code. The latter is easier than the former, so if you're trying to decompile executables, then perhaps you should train two models: one to convert machine code to assembly, and the other to convert assembly to C. Assembly code produced by an optimizing compiler might differ significantly from assembly code which closely corresponds to C code.

1 comments

tsimionescu 2256 days ago

> perhaps you should train two models: one to convert machine code to assembly, and the other to convert assembly to C.

Is the step of going from machine code to gcc-produced assembly not trivial? Is gcc actually producing assembly code that an assembler needs to do more with than convert to the corresponding opcodes?

link

p1esk 2256 days ago

There are two kinds of assembly: 1. assembly that corresponds to optimized machine code, and 2. assembly that closely corresponds to the original C code. As I said, these two assembly versions might look very different depending on optimizations performed by the compiler. You can reduce the difficulty of learning the conversion from machine code to assembly at the expense of increasing the difficulty of learning the conversion from assembly to C code (and vice versa).

link