Hacker News new | ask | show | jobs
by mnarayan01 3376 days ago
It seems like this is simply a disassembler error (albeit an understandable one). Am I missing something?

Edit: Based on the responses below, I guess the point is that the disassembler can't generate Java code that will "naively" (wrong word, but I can't think of a better one) generate the same output. Notable (I assume) in that name munging would be problematic outside the current compilation unit.

4 comments

No, Java class files usually include the names of variables and functions, so this isn't the disassembler's fault. The class file actually had two functions with the same name. You could certainly implement an anti-obfuscation layer to detect stuff like this, but I wouldn't call it an "error" as is.
I think it's a common expectation that a disassembler should provide output that is valid to be compiled, and that therefore this is an error.
Sure, but it's also expected that it should provide output that could be compiled to produce the input, and in this case it's impossible to satisfy both those constraints. The best thing would probably be to leave a comment in the generated source code explaining the problem, and provide an option to rename overlapping functions.
It's not necessary always possible to output valid-to-compile Java sources. If the bytecode came from a different JVM language, then there are times where javac can't emit certain bytecode patterns.
I would argue that decompilers are primarily reference tools (i.e., a more readable disassembly). It is wrong to see them as source code recovery tools because they will never be able to capture every aspect of the original program. So it doesn't make sense to have as a primary goal the ability to provide output that can be compiled again. It is more important that they more faithfully represent the disassembly.
Well, it's not a jvm class file but a dalvik Dex file. A disassembler which can't generate compilable code from this valid Bytecode is incomplete or in other words: buggy.
Not if you are talking about compilable java code, because as the article explains, there are things you can express in byte code which you can simply not express in java source code.
Indeed. The lower level language must be more expressive by the "definition". It is more difficult to write but allows fine grain control. This is the reason some optimization and obfuscation tricks are done in Assembler (native world, not Java). And hence disassembler simply can not re-translate it back.
There are two valid ways for a disassembler mitigate this: a) decompile to a language in which the bytecode can be expressed (in a concise / expresive manner, Java would always be a "possible" target because of turing completeness) or b) accommodate for the fact that there could be signature collisions in java, e.g. by prefixing/suffixing the method name
If you change the method name you end up with code that acts differently, just imagine something that does something like this pseudocode:

if (!new Exception().getStackTrace().getSha1sum().startsWith("0000")) alert("hello decompiler")

Your comment about java and turing completeness doesn't make sense unless you want the decompiler to basically output a java implementation of a JVM?

Dare [0] emulates the Dalvik VM's runtime behavior to generate verifiable (for the vast majority of cases) Java bytecode from dex bytecode.

[0]: http://siis.cse.psu.edu/dare/

What did the disassembler do wrong? It just happens that there is no valid Java code which could produce that (valid) bytecode. What should it have outputted instead?
Output the method names as a_void and a_string instead of just a?
Then it wouldn't be true to the real structure of the program (although it would be true to the real behaviour of the program). Why is that less wrong?
> Why is that less wrong?

It compiles.

But may not run.
It's obviously not impossible to create legit Java code during disassembly. This is like an arms race; go ahead and dissemble my code, but I've made sure that you now need a better disassembler to produce valid code.

Eventually that disassembled will come along, and then more tricks will be used by the obfuscators...

This is not a proper arms race; it terminates with a victory for the disassemblers in a finite and feasible amount of work.

There is an arms race in whether the resulting output is at all human-comprehensible. The only thing preventing that from terminating with victory for the obfuscaters is that the obfuscators have a finite technical budget with which to obfuscate. Changing identifiers is nearly free, but as you go beyond that and start rewriting the source code itself they start incurring performance penalties and increasing odds that the rewrite will fail as they get more aggressive.

It would fail at runtime if this method is called by reflection.
Yes. That reflection code would have also failed the moment the obfuscator did it's thing, so that is not really a concern. The two are not really compatible.
In basic blocks (no conditionals or loops), a disassembler mostly does a mechanical job translating opcodes to the appropriate Java code and aggregating expressions. But it doesn't rename functions and or classes. They are left as were in the bytecode.
No, it's a mismatch between what the compiler accepts and what the JVM can execute.

The interesting part to me is that it seems that Java would be perfectly capable of differentiating between methods by return type if the compiler was tweaked slightly. Is there a reason why this isn't a formal language feature?

As explained in the post, function call is a full expression for which appropriate function should be found. If there 2 functions with the same name and param types it would be impossible to compile such an expression in its own.
Not necessarily, you'd just need a syntactic mechanism to disambiguate.
And have one, since Java either throws away the return value, or assigns it to a variable the type of which is known.

Edit: here's the edge case though. You can call a function and use the return value directly as a parameter: foo(bar()). It's possible to have two foo that take both possible bar return types, at which point the compiler is stuck.

It could require a cast in this instance, however. The more I think about this the more I wonder why this isn't possible.

> here's the edge case though.

That's no more an edge case than the cases where you throw away the return value or you're binding to an ambiguous type e.g. `A getFoo()`, `B getFoo()`, `Object foo = getFoo()`.

> The more I think about this the more I wonder why this isn't possible.

The Java spec does not say, for C++ Stroustrup states it's

> to keep resolution for an individual operator or function call context-independent.

the Java reason is likely also some sort of Principle of Least Surprise claim.