Hacker News new | ask | show | jobs
by londons_explore 815 days ago
If an LLM is used, it's unclear how to best do it.

One could try to train ones own LLM from scratch, using an encoder-decoder (translation - aka seq2seq) architecture trying to predict the correct variable name given the decompiled output.

One could try to use something like GPT-4 with a carefully designed prompt "Given this datastructure, what might be the name for this field?"

One could try to use something pretrained like llama, but then finetune it based on hundreds of thousands of compiled and decompiled programs.

1 comments

Option 4:

One could take an pretrained model like llama, train it on only a few thousands of compiled and decompiled programs, then feed it compiled programs and have it decompile them and evaluate that output to make a new dataset and fine tune it again. Repeat until satisfactory.