| I'm planning on applying for PhD programs this fall to work in this area. There are only a few places in the world right now that I know of working on these types of problems. They are: * Martin Vechev, ETH Zurich * Dawn Song, University of California Berkeley * Eran Yahav, Technion * Miltiadis Allamanis, Microsoft Research Cambridge If anyone knows other advisors looking for graduate students in this area, please let me know. Due to personal circumstances I can most likely not apply to ETH Zurich or Technion (I don't speak Hebrew anyway), which leaves me with only one potential advisor in a program that I really want. There is also the Python writing model that Open AI showed recently at the Microsoft Build conference, so maybe there is some interest growing at other places as well. I was also recently working on a deep learning decompiler but was unable to get my transformer model to learn well enough to actually decompile x64 assembly. I have the source code for the entire Linux kernel as training data, so it's not an issue with quantity. If anyone is interested in helping out with this project, please let me know in a comment. |
Linux kernel is only ~30M LOC. That's a really small dataset. For comparison, the reddit based dataset for GPT-2 is 100 times larger. Try using all C code posted on Github.
decompile x64 assembly
You can't "decompile" assembly. Either you decompile machine code, or you disassemble assembly code. The latter is easier than the former, so if you're trying to decompile executables, then perhaps you should train two models: one to convert machine code to assembly, and the other to convert assembly to C. Assembly code produced by an optimizing compiler might differ significantly from assembly code which closely corresponds to C code.