Hacker News new | ask | show | jobs
by a2code 819 days ago
The problem is interesting in at least two aspects. First, an ideal decompiler would eliminate proprietary source code. Second, the abundant publicly available C code allows you to simply make a dataset of paired ASM and source code. There is also a lot of variety with optimization level, compiler choice, and platform.

What is unclear to me is: why did the authors fine-tune the DeepSeek-Coder model? Can you train an LLM from zero with a similar dataset? How big does the LLM need to be? Can it run locally?

4 comments

Most proprietary code runs behind firewalls and won't be affected by this one way or another.

It's basically always better to start training with a pre-trained model rather than random, even if what you want isn't that close to what you start with.

Ideal decompilers do not exist. In some sense they can never exist as compilers are lossy, but even taking a liberal view of “high level understanding of the resulting code” this is essentially the AGI for computer security. Nobody has come close to it!
Thanks! Training a language model from scratch is data-intensive; Llama2 was developed using 2 trillion tokens, while our dataset is around 4 billion.

The appropriate size of the model is not straightforward to determine. In our experiments, a 7 billion parameter model achieved 21% executability compared to just 10% for a 1 billion parameter model. However, their re-compilability rates are quite similar.

To run a 1 billion parameter model, a minimum of 2GB GPU memory is necessary, which is feasible on most GPUs. A 7 billion parameter model needs 14GB, suitable for GPUs like the 3090/4090 series. For running a 33 billion parameter model, an A100 GPU (80G) would be the single card option, although technically a MacBook could work, but you won't really want to use it.

I assume it's related to the cost of training vs fine-tuning. It could be also a starting point to validate an idea.