Hacker News new | ask | show | jobs
by Retr0id 66 days ago
Tokens can also be burnt on decompilation.
3 comments

Yes, and it apparently burns lots of tokens. But what I've heard is that the outcomes are drastically less expensive than hand-reversing was, when you account for labor costs.
Can confirm. Matching decompilation in particular (where you match the compiler along with your guess at source, compile, then compare assembly, repeating if it doesn't match) is very token-intensive, but it's now very viable: https://news.ycombinator.com/item?id=46080498

Of course LLMs see a lot more source-assembly pairs than even skilled reverse engineers, so this makes sense. Any area where you can get unlimited training data is one we expect to see top-tier performance from LLMs.

(also, hi Thomas!)

My own experience has been that "ghidra -> ask LLM to reason about ghidra decompilation" is very effective on all but the most highly obfuscated binaries.

Burning tokens by asking the LLM to compile, disassemble, compare assembly, recompile, repeat seems very wasteful and inefficient to me.

LaurieWired did a good episode about that kind of thing https://www.youtube.com/watch?v=u2vQapLAW88
That matches my experience too - LLMs are very capable in "translating" between domains - one of the best experience I've had with LLMs is turning "decompiled" source into "human readable" source. I don't think that "Binary Only" closed-source isn't the defense against this that some people here seem to think it is.
Has anyone used an LLM to deobfuscate compiled Javascript?
> Has anyone used an LLM to deobfuscate compiled Javascript?

Seems like a waste of money; wouldn't it be better to extract the AST deterministically, write it out and only then ask an LLM to change those auto-generated symbol names with meaningful names?

yes, but it requires some nudging if you don't want to waste tokens. it will happily grep and sed through massive javascript bundles but if you tell it to first create tooling like babel scripts to format, it will be much quicker.
> but if you tell it to first create tooling like babel scripts to format, it will be much quicker.

Can you expand on this? Is that existing tooling for deminification?

for me it was custom scripts looking for data in minified bundles and refactoring for easier protocol reverse engineering, e.g. https://github.com/echtzeit-solutions/monsgeek-akko-linux/bl...
I've used it for hobby efforts on Electron/React Native (Hermes bytecode) apps and it seems to work reasonably well
Yep. They are good at it.
Yeah, it's token intensive but worth it. I built a very dumb example harness which used IDA via MCP and analyzed/renamed/commented all ~67k functions in a binary, using Claude Haiku for about $150. A local model could've accomplished it for much less/free. The knowledge base it outputs and the marked up IDA db are super valuable.
Do you have the repo example?
I did something similar using ghidramcp for digging around this keyboard firmware, repo contains the ghidra project, linux driver and even patches to the original stock fw. https://github.com/echtzeit-solutions/monsgeek-akko-linux
Another asymmetric advantage for defenders - attackers need to burn tokens to form incomplete, outdated, and partially wrong pictures of the codebase while the defender gets the whole latest version plus git history plus documentation plus organizational memory plus original authors' cooperation for free.
>original authors' cooperation

Ha

>for free.

Haha, it is more complicated in reality

> Tokens can also be burnt on decompilation.

Prediction 1. We're going to have cheap "write Photoshop and AutoCad in Rust as a new program / FOSS" soon. No desktop software will be safe. Everything will be cloned.

Prediction 2. We'll have a million Linux and Chrome and other FOSS variants with completely new codebases.

Prediction 3. People will trivially clone games, change their assets. Modding will have a renaissance like never before.

Prediction 4. To push back, everything will move to thin clients.

I think if prediction 1 is true (that it becomes cheap to clone existing software in a way that doesn't violate copyright law), the response will not be purely technical (moving to thin clients, or otherwise trying to technically restrict the access surface to make reverse engineering harder). Instead I'd predict that companies look to the law to replace the protections that they previously got from copyright.

Obvious possibilities include:

* More use of software patents, since these apply to underlying ideas, rather than specific implementations.

* Stronger DMCA-like laws which prohibit breaking technical provisions designed to prevent reverse engineering.

Similarly, if the people predicting that humans are going to be required to take ultimate responsibility for the behaviour of software are correct, then it clearly won't be possible for that to be any random human. Instead you'll need legally recognised credentials to be allowed to ship software, similar to the way that doctors or engineers work today.

Of course these specific predictions might be wrong. I think it's fair to say that nobody really knows what might have changed in a year, or where the technical capabilities will end up. But I see a lot of discussions and opinions that assume zero feedback from the broader social context in which the tech exists, which seems like they're likely missing a big part of the picture.