Hacker News new | ask | show | jobs
by isido 3874 days ago
This site and accompanying book (http://beginners.re/Reverse_Engineering_for_Beginners-en.pdf) seem a nice effort. Briefly reading the book, it seems that you need to be an intermediate or advanced beginner, or know something about assembly beforehand.

Some terms (opcode, ISA) are not really explained (except in the glossary) before they are used and there are perhaps too much detail in the expense of the bigger picture.

Criticisms aside, the book and the challenges seem interesting and the efforts of the author must be appreciated!

3 comments

Get past the idea that you need to be advanced to grok assembly language. Assembly is in a lot of ways easier than a lot of higher-level languages. When I was growing up in the early 90s, a lot of my friends started on x86 assembly as a first language --- and x86 is the most annoying instruction set to learn.

The right way to learn this stuff is to dive in. You'll be over your head for a few hours, but you'll get your bearings. There are topics this technique doesn't work great with, but assembly reversing isn't one of them.

An additional benefit: assembly is one of those things that you might not use all the time in your career (although I've ended up using it quite a bit), but that will nonetheless illuminate lots of other things about computer science. There's a reason Knuth used it as a language to express algorithms in TAOCP.

I can sum up the core idea of assembly for you in just a few sentences:

* You're given 8-32 global variables of fixed size to work with, called "registers".

* Virtually all computation is expressed in terms of simple operations on registers.

* Real programs need many more than 8-32 variables to work with.

* What doesn't fit in registers lives in memory.

* Memory is accessed either with loads and stores at addresses, as if it were a big array, or through PUSH and POP operations on a stack.

* Memory is to an assembly program what the disk is to a Ruby program: you pull things out of memory into variables, do things with them, and eventually put them back into memory.

* Control flow is done via GOTOs --- jumps, branches, or calls.

* A jump is just an unconditional GOTO.

* Most operations on registers, like addition and subtraction, have the side effect of altering status flags, like "the last value computed resulted in zero". There are just a few status flags, and they usually live in a special register.

* Branches are just GOTOs that are predicated on a status flag, like, "GOTO this address only if the last arithmetic operation resulted in zero".

* A CALL is just an unconditional GOTO that pushes the next address on the stack, so a RET instruction can later pop it off and keep going where the CALL left off.

Everything else is just a detail.

When working with assembly, a lot of people will just get the programmer's reference manual for the instruction set in a PDF (they're published for free). Here's a short one for X86:

http://ref.x86asm.net/coder32.html

Here's ARM:

http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001m/QR...

Here's AVR:

http://www.atmel.com/images/atmel-0856-avr-instruction-set-m...

Even if you'd never written a line of assembly, if I asked you to express the procedures of a simple Ruby program in assembly and gave you the instruction set reference and those sentences, you would figure out how to get the job done in a couple hours. No monads or linear algebra required!

Agreed about ease.

The first time I reversed anything was when I was playing a shareware game on my mom's Quadra 650 running Mac OS 7.6.1. I was maybe 12 or 13. I had reached the end of the game's limited demo, and it asked me to enter a code.

I discovered that a program called "Super ResEdit" would open up the game and show me its internal resources. Icons, text, menubars... and also a long column of lines that looked something like "ADD R1, R2, R3" and "CMP R2, R3" and "BNE +0x8".

Anything of the form "B ..." or "BEQ ..." or "BNE ...", when I moved the cursor over it, would develop an arrow pointing to a different line. Aha! "B" stands for "Branch". In that case, "BEQ" stands for "Branch if equal", "BNE" stands for "Branch if Not Equal". It must be that "CMP", which usually preceded one of these lines, stood for "Compare"...

After that it was a matter of finding a section called "do_registration_check", seeing a bunch of arithmetic ("ADD", "MULT"), then a "CMP" followed by "BNE". Apparently if I entered an incorrect code, the "Branch Not Equal" path would be followed. I didn't know about NOPs at the time, since the compiled code didn't have any, but I could change a "BNE" to a "BEQ".

Super big rush! Discovering, on my own, how to take something apart and bend it to my will.

Just a couple of annoying nitpicks:

> * Most operations on registers, like addition and subtraction, have the side effect of altering status flags, like "the last value computed resulted in zero". There are just a few status flags, and they usually live in a special register.

Note that this is a CISCism. Most RISC designs, like ARM, allow the compiler to specify whether condition codes are to be updated in order to more easily eliminate false dependencies.

> * A CALL is just an unconditional GOTO that pushes the next address on the stack, so a RET instruction can later pop it off and keep going where the CALL left off.

Also a CISCism. Most RISC designs have a "link register" that the return address goes into, and the stack push, if desired, has to be done manually. RET is, in this case, just an unconditional branch to the address in LR.

These aren't true of MSP430 or AVR, both of which are RISC-ish designs. Also, you still have to save the link register inside functions that call other functions. I also didn't explain SPARC register windows. :)

But hopefully these nitpicks aren't so much "annoying" as they are an example of how little you need to know to follow a discussion about super nitpicky details of different architectures. AVR has a general-purpose programmable condition flag 'T' in its status register. What do you do with the 'T' flag? I'unno, but hopefully the basic idea makes sense after not- too- much reading!

6502 assembly is surprisingly popular as a learning platform.

https://news.ycombinator.com/item?id=4213806

Hm. 12 sentences. Someone here can do it better in fewer sentences, I think.
Much of assembly programming is variants of these commands:

    - mov <dst>, <src>            | dst = src
    - sub <dst>, <src a>, <src b> | dst = a - b
    - jump <label>                | continue running at <label>
    - jump if equal <label>       | continue running at <label> if both sources of
                                    the last command were equal
    - call <label>                | save the current location, then continue
                                    running at <label>
    - ret                         | continue running after the previous CALL
                                    instruction
<dst> can be:

    - any memory address
    - any of 8-32 available temporary integer variables called "registers"
<src> can be any valid <dst> or a hardcoded integer known as an "immediate"
This is both more detailed than mine and uses fewer sentences. My only nit is that I feel like understanding status flags is really important.
I had several nitpicks with mine, but realized once the learner starts asking questions about gaps they're in a pretty good place to find an answer on their own.

I do agree flags is less obvious, but also takes a bit of space to list what kind of flags one might see:

"jump if equal" is facilitated by an implicit "flags" register which is set after most arithmetic or comparison operations like "sub". Possible flags include:

- carry: the last arithmetic operation overflowed

- zero: result of the last operation was zero

- parity: result of the last operation was odd

- sign: result of the last operation was negative

In the context of reverse engineering I feel like learning assembly this way is 'doing it wrong'. Of course, there's no such thing as 'doing it wrong' when it comes to learning, but here's what worked for me.

What worked for me was going in the opposite direction: start with simple C programs [0], and see what compilers do with them. If you understand C, you already kind of understand how the machine works, though without the machine specifics. If you see an assembly instruction that you don't recognize, check the manual [1]. You can do this online these days, say, with [2]. Here's an example of a simple program that covers calls and branches: http://goo.gl/DKrYrE

Then, use an interactive debugger (like OllyDbg or whatever works for your platform) to trace through your small programs, instruction by instruction, and see how memory and registers get manipulated at each step. Change instructions and see what happens. This will also make you familiar with common compiler idioms, which is very useful in RE work. Once you get reasonably familiar with these small programs, try your hand at a program you _don't know_, or increase the complexity of your small programs. Rinse and repeat.

[0] The choice of C here is relevant, since many other compiled languages tend to add a lot of cruft to their binaries.

[1] http://www.felixcloutier.com/x86/

[2] http://gcc.godbolt.org/

I would generally agree that writing/manipulating assembly language is a better learning device than black-box reversing (I mentioned reversing because of the context of the article).

But learning to reverse assembly language is also one of those daunting tasks that turns out not to live up to its scary reputation. I wouldn't want to suggest that you can't just dive in and learn to reverse if reversing is your actual goal.

My bad; I somehow skipped over your paragraph saying to 'dive in', which renders all of my disagreement invalid.
No, your disagreement is awesome stuff. I'm glad I got you to write it down.
You're given 8-32 global variables of fixed size to work with, called "registers".

Luxury!

I think this is better than the super terse one downthread. Just saying statuses are in a register (as a simplification) and mentioning the PC might be a way to make it shorter and reinforce the theme that instructions change the internal state of the cpu which is also kept in special 'registers'.

You're right; this might be a better explanation if it somehow captured simple stack machines.
> Some terms (opcode, ISA) are not really explained (except in the glossary) before they are used and there are perhaps too much detail in the expense of the bigger picture.

I feel like if you want to get into serious reverse engineering (beginner or not), you should have a basic understanding of computers. Both of these terms are extremely basic, from that perspective.

Opcodes are what Assembly instructions correlate to. ISA just refers to the architecture of the CPU (x86 vs ARM)...e.g. WHAT opcodes/asm applies.

I agree. What I meant was that the title "Reverse Engineering for Beginners" might imply that this material would guide the reader from the beginning. Having background in 6502 assembly (thank you, Vic-20!) I didn't have that much trouble in following the presentation in the book. But if coming from higher level programming background and not having basic grasp of how CPUs work, following the text might be difficult.
Do you have any recommendations for beginners?
I don't remember exactly how I got started, probably by trying to replicate what bigger kids were doing with "intros" and demos. I don't know what would nowadays be a simple enough starting point. But I remember William Stallings' and Andrew Tannebaum's books on operating systems and computer architecture in the university 20 years ago. I don't know if they have been updated, but they might be good starting point, if you need to start from the beginning.

On the other hand, if you know basics of how to CPU works, what stack is, know binary and hexadecimal, and some C, you might be able to just read Dennis' book, just ignore things you don't get at first, since many things get explained later in the book, after which you can re-read sections you didn't quite understand first.

I'm personally quite slacking in this area. I'll do my best to get more advanced, thanks for the info!