Hacker News new | ask | show | jobs
by SuperNinKenDo 901 days ago
This is something I've been interested in for a while.

I've collected a few links people have already posted to their own projects or write-ups here and elsewhere, but is there any single excellent resource for learning how to do this?

I've a number of dead and/or proprietary formats that I've always wanted to crack open, but I'm totally overwhelmed with where to start.

2 comments

While I don't have any handy link, I did reverse-engineered several file formats without any further information and I can give some points.

First, make sure that you know what the format is actually supposed to encode. For example, if some file weighs (say) 40 KB then it is unlikely to be a raster image. The file name, if any, helps a lot to narrow the scope.

Second, you should have some understanding of similar file formats. I generally recommend to study PNG first because it gives an example of typical structured file formats and raster image formats. (Don't delve into the compression though---bitwise analysis is much harder.) This is also why you needed to know what the format is for, many formats with the same goal tend to have similar structures.

Third, collect as many examples as possible. You can line them up to see commonalities and differences and spot patterns. Even better if you can actively generate different files. This is generally the last hope when you are run out of reasonable hypotheses.

Fourth, optimize the feedback loop. You will have to do a lot of hypothesization, validation and automation. You can't really optimize the number of iterations, but you can optimize the time for a single iteration. Use a comfortable scripting language with good binary operation. I tend to use a vanilla Python with struct and make everything else by my own, but there are several libraries that greatly help you if you don't feel like doing so.

I had reversed engineered some ASCII file formats. While probably overkill, my background parsing simple programming languages (for which there are many good educational resources) was really helpful (in the approach I use). I tokenize, and try figuring out syntax structures from the order of token types, then from there, extract the information I need into my program's data representation. I'm not sure if this is the approach used by everyone else, but it seems plausible for someone with a CS/PL implementation background.

But first, it helps to have sample files to see recurring structures. Ideally, you also have access to software that generates these files. This allows you to deal with simpler files containing less information to reason about, make small changes within the program and compare the corresponding change(s) in the file.