Hacker News new | ask | show | jobs
by lifthrasiir 900 days ago
While I don't have any handy link, I did reverse-engineered several file formats without any further information and I can give some points.

First, make sure that you know what the format is actually supposed to encode. For example, if some file weighs (say) 40 KB then it is unlikely to be a raster image. The file name, if any, helps a lot to narrow the scope.

Second, you should have some understanding of similar file formats. I generally recommend to study PNG first because it gives an example of typical structured file formats and raster image formats. (Don't delve into the compression though---bitwise analysis is much harder.) This is also why you needed to know what the format is for, many formats with the same goal tend to have similar structures.

Third, collect as many examples as possible. You can line them up to see commonalities and differences and spot patterns. Even better if you can actively generate different files. This is generally the last hope when you are run out of reasonable hypotheses.

Fourth, optimize the feedback loop. You will have to do a lot of hypothesization, validation and automation. You can't really optimize the number of iterations, but you can optimize the time for a single iteration. Use a comfortable scripting language with good binary operation. I tend to use a vanilla Python with struct and make everything else by my own, but there are several libraries that greatly help you if you don't feel like doing so.