| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tptacek 4092 days ago
	Write a disassembler at least once. It's much easier than you think it is (even with X86) --- it's essentially a file format parser --- and very illuminating.

4 comments

userbinator 4091 days ago

It's also much easier to decode x86 instructions when you look at them in octal instead of the hexadecimal that most tables use, since both the main opcode map and ModRM/SIB are organised in a 2-3-3 layout:

http://reocities.com/SiliconValley/heights/7052/opcode.txt

The 8080/8085/Z80 instruction sets also look much better in octal:

http://www.z80.info/decoding.htm

link

101914 4091 days ago

This comment validates all the time I have "wasted" reading HN over the years.

It does not suprise me that something so simple would be so well overlooked (or, at least, "forgotten"). I wonder if I ever would have figured this out from my own readings and experiments. Doubtful.

Great tip!

link

userbinator 4091 days ago

I figured it out before/without exposure to that document, but I attribute it to the fact that I started teaching myself at a time when octal was more common amongst mini and micro-computers; most programmers these days barely know any number base other than decimal, and of those who do, binary and hexadecimal are likely far more familiar to them than octal. The official Intel/AMD manuals make no reference to octal either, using only binary and hex.

As an aside, ARM opcodes are (mostly) hex-structured with 4-bit fields, while MIPS, POWER, and SPARC are not amenable to any standard number base except binary (5- and 6-bit fields.)

link

tptacek 4091 days ago

I had no idea! This is very cool.

link

tptacek 4091 days ago

This is seriously exploding my brain. Thank you for posting it.

link

peterfirefly 4091 days ago

And that's really interesting (and puzzling).

You are clearly intelligent, well educated, etc. and blah blah blah.

Yet, you somehow missed that.

If you can explain why, it would probably be something we could all learn from. At the very least, it would be interesting.

Is it because you learned x86 before you learned octal and never really reexamined the encoding?

(The PDP-11 machine code also looks best in octal, btw.)

link

danieljh 4092 days ago

As someone looking more into x86_64 assembly, instruction encoding, syscalls and ELF files recently, not only the lack of good starting points but also the amount of work required to get into it is a pity.

I'm currently using [0] as a helping hand among other resources which is quite good; my maybe not-so-interesting results are at [1].

For example, to get a good overview of instruction encoding you have to 0/ read through and ditch horrible blog posts 1/ find the correct Intel manual 2/ search and read through thousands of PDF pages until you find something interesting 3/ understand the environment and facts that are either implicitly given or in the documents but not easy to find.

For the handful lines of actual code I wrote yesterday [1] I still have around 25 tabs open. Complexity and no end in sight.

Do you have any recommendations and hints as where to start with this in the year 2015?

[0] http://0xax.blogspot.se/p/assembly-x8664-programming-for-lin...

[1] https://github.com/daniel-j-h/x86-64-asm

link

tibiapejagala 4092 days ago

Seconding this. I've implemented a part of x86 instruction encoding and you either find resources:

* comprehensible, but far from complete (some blogs)

* complete, but hard to understand and requiring some implicit knowledge (Intel manual or [1])

Rather than disassembler I recommend writing some simple JIT compiler, with [2] as a starting point. You skip some problems this way.

[1] http://ref.x86asm.net/ this seems pretty cool as a reference, but I can't wrap my head around it

[2] http://eli.thegreenplace.net/2013/11/05/how-to-jit-an-introd...

link

SCHiM 4092 days ago

I use that first reference extensively.

But you have to understand that it's just a reference, it doesn't give you the complete picture. It just shows you the important stuff when you already know where to look.

I've written partial disassemblers/assemblers. And that site has been a huge help to me.

My 2 cents:

Start with being able to decode the mov instruction, with all the different possible memory encodings. Once you understand how you parse the memory/addressing scheme of x86 it's suddenly a whole lot easier. And I agree that writing an assembler to start is probably easier, to write a disassembler it has to be complete, but an assembler doesn't have to support all instructions to work.

link

jonlox 4092 days ago

I've written a pretty complete assembler a few years back. My advice, if you want to truly learn encoding, you need to write an assembler. The reason being, as you're trying to figure out if your assembler is generating the correct instructions you're going to be looking at it in hexdump format for days or weeks. Pretty soon you're going to notice prefixes, and will be able to visually decode instructions just by looking at them in hex bytes. It's really not that hard after a little practice, and knowing ModRM.

I will emphasize, the Intel manual is pretty much all I used. Along with NASM. I looked at NASM source a lot to figure out what they did, but also used NASM to compare generated instructions. The Intel manual is critical. I would go straight to the authoritative source. It's not hard to follow once you understand the terminology and format a bit. Just keep reading it.

edit: Oh, and, the most important thing to ever know: Intel is little-endian! I cannot stress understanding the importance of this enough. Even when you know this, it's very easy to forget it when looking at code.

link

Mahn 4092 days ago

And then you realize you could actually take it a step further and before you know it you are hunting for documentation on the amount of cycles each intruction and data access takes and other obscure clocking info, nevermind there are a thousand and one emulators already in existence that are several orders of magnitude better than the half assed attempt you are trying to come up with. Fun times :)

link

tptacek 4092 days ago

That's such a huge problem with exploratory programming: the demotivating effect of knowing that there are better versions of almost anything you're building, because other people have been working on the problem for longer.

You have to get over it and keep going anyways or you'll never become one of those people yourself! (Or even be able to make an informed decision about whether you want to).

Not coincidentally: part of the point of the company we just started. :)

link

goalieca 4092 days ago

Cycle accurate emulation of modern x86 would be impossible. Intel simply won't release the secret sauce on branch prediction, cache prediction, out of order execution, internal instruction engine, etc.

link

im2w1l 4092 days ago

On the other hand, the absence of such information means that games don't depend on it.

link

mschuster91 4092 days ago

I'm currently writing one in JavaScript of all languages, just to prove it works. also, it's true multi-platform ;)

link

ianopolous 4091 days ago

You may find this disassembler port to Java I did useful. [JayD](http://github.com/ianopolous/JayD) it's more accurate than both objdump and its source, udis86, after I used it as the base for the [JPC emulator](http://github.com/ianopolous/JPC).

link