Hacker News new | ask | show | jobs
by Jugglerofworlds 2209 days ago
I'm planning on applying for PhD programs this fall to work in this area. There are only a few places in the world right now that I know of working on these types of problems. They are:

* Martin Vechev, ETH Zurich

* Dawn Song, University of California Berkeley

* Eran Yahav, Technion

* Miltiadis Allamanis, Microsoft Research Cambridge

If anyone knows other advisors looking for graduate students in this area, please let me know. Due to personal circumstances I can most likely not apply to ETH Zurich or Technion (I don't speak Hebrew anyway), which leaves me with only one potential advisor in a program that I really want.

There is also the Python writing model that Open AI showed recently at the Microsoft Build conference, so maybe there is some interest growing at other places as well.

I was also recently working on a deep learning decompiler but was unable to get my transformer model to learn well enough to actually decompile x64 assembly. I have the source code for the entire Linux kernel as training data, so it's not an issue with quantity. If anyone is interested in helping out with this project, please let me know in a comment.

9 comments

I have the source code for the entire Linux kernel as training data, so it's not an issue with quantity

Linux kernel is only ~30M LOC. That's a really small dataset. For comparison, the reddit based dataset for GPT-2 is 100 times larger. Try using all C code posted on Github.

decompile x64 assembly

You can't "decompile" assembly. Either you decompile machine code, or you disassemble assembly code. The latter is easier than the former, so if you're trying to decompile executables, then perhaps you should train two models: one to convert machine code to assembly, and the other to convert assembly to C. Assembly code produced by an optimizing compiler might differ significantly from assembly code which closely corresponds to C code.

> perhaps you should train two models: one to convert machine code to assembly, and the other to convert assembly to C.

Is the step of going from machine code to gcc-produced assembly not trivial? Is gcc actually producing assembly code that an assembler needs to do more with than convert to the corresponding opcodes?

There are two kinds of assembly: 1. assembly that corresponds to optimized machine code, and 2. assembly that closely corresponds to the original C code. As I said, these two assembly versions might look very different depending on optimizations performed by the compiler. You can reduce the difficulty of learning the conversion from machine code to assembly at the expense of increasing the difficulty of learning the conversion from assembly to C code (and vice versa).
As part of the short list:

* Swarat: https://blog.sigplan.org/2020/04/15/synthesizing-neurosymbol... <- probably heaviest on the math side wrt PL people

* Ras Bodik (Berkeley -> UW): esp. w/ Pedro Domingos and all the MSR collaborators (Sumit Gulwani, ...) <- a bit biased b/c I was in the group while at Berkeley; Ras + Dawn are crazy creative

* Percy Liang (Stanford): Coming from the ML side and w/ a long-running interest here

UCL is another good choice for this area. I did some research in this area back in undergrad. There was a workshop called NAMPI (neural abstract machine program intelligence) on this subfield so I'd recommend going through the accepted papers for it and seeing where all the faculty come from.

Also general advice, after finding a few papers you like go through the papers they cite that are relevant to the subfield and also papers that cite them. That's one of the best ways of finding other related research that interests you.

I've listened to some talks by Song, and my impression is that she does not seem to have a strong technical grasp of what's going on, for what it's worth. There is a lack of clarity in her communication style. I'd try to work with Vechev, Liang, UW PLSE group if I were you...
If the language barrier is the only reason keeping you from applying to the Technion, or any other university in Israel, graduate courses in Israel will generally be given in English if a non-Hebrew speaker is present. It's standard practice, just email the instructor in advance. Besides, Haifa is a very nice city.

You should also check out professors at Ben Gurion University, Tel Aviv University and the Hebrew University who might have similar interests, IIRC. Feel free to hit me up if there's some page in Hebrew that doesn't translate well.

Another person to talk to would be Brooks Paige, who recently joined UCL: https://tbrx.github.io/ This is totally not my area of research (I'm a PL person), but I know Brooks from the Alan Turing Institute and I think he'd be a great advisor for this kind of work.
Premkumar Devanbu at UC Davis works in this area. I've taken his graduate course on the subject.
Maybe don't discount the Technion so easily. All of the professors are fluent in English as is most of the student body. There are many graduate students who start studying there without knowing Hebrew.
There are a lot of groups working in this area. I'm going to pitch ours first, and then also point you to some others!

My colleagues and I run the SEFCOM lab at Arizona State University (https://sefcom.asu.edu/). Most relevant to your interests, Fish Wang (fellow SEFCOM professor) and I (Yan Shoshitaishvili) founded the angr program analysis framework (https://angr.io/) back in our gradschool days and have continued to evolve it together with our students in the course of our research at ASU. We're actually currently undertaking a concerted push into decompilation research, using both PL and ML techniques. This research direction is a passion of ours, and there's plenty of room for you here if you're interested!

Of course, we also do quite a bit of work in other areas of program analysis (including less overtly "mathy" techniques, like fuzzing) as well as other areas of cybersecurity. We are also quite active in the Capture the Flag community, if that is something that interests you!

Other places that do research in program analysis off the top of my head:

- Chris Kruegel (https://sites.cs.ucsb.edu/~chris/) and Giovanni Vigna (https://sites.cs.ucsb.edu/~vigna/) at UCSB (disclaimer: I got my PhD from them!)

- Davide Balzarotti (http://s3.eurecom.fr/~balzarot/) and Yanick Fratantonio (https://reyammer.io/) at EURECOM

- Antonio Bianchi (https://antoniobianchi.me/) (and, soon, Aravind Machiry) at Purdue

- Alexandros Kapravelos (https://kapravelos.com/) at NCSU

- Taesoo Kim (https://taesoo.kim/) at Georgia Tech

- Yeongjin Jang (https://www.unexploitable.systems/) at O(regon)SU

- Zhiqiang Lin (http://web.cse.ohio-state.edu/~lin.3021/) at O(hio)SU

- Brendan Dolan-Gavitt (https://engineering.nyu.edu/faculty/brendan-dolan-gavitt) at NYU

- Wil Robertson (https://wkr.io/) and Engin Kirda (https://www.ccs.neu.edu/home/ek/) at Northeastern

If you have questions about the PhD process or this research area, feel free to reach out: yans@asu.edu or @Zardus on twitter!