Hacker News new | ask | show | jobs
by theincredulousk 1485 days ago
I had a systematic way of doing this for reverse engineering a large undocumented system. First you get first principles in mind - there is control flow (execution), data (files, databases), and communication interfaces (IPC, network, etc.). You don't need a sophisticated modeling tool - just any Visio-esque, giphly, draw.io, etc.

For these things you know there are systematic ways of finding them - for mine it was a C/C++ Project so:

1. Find all executables via build output, or in the running system. For now you're largely going to ignore the details of what the code is doing. You just want to know what is actually "running" at runtime :). 2. Figure out where the entry points to those executables are, like a main. These are usually easy to search for or discover by convention. 3. Find out what threads it spawns 4. Start a simple diagram with a box at the top named for the executable, and branch down to one box for each thread. Manually trace control flow for each thread, adding boxes at points you think are noteworthy logical units. E.g. often threads will have some kind of main loop they sit in, which is a key element for understanding what that thread is doing. 5. Continue for nested threads and worker (short lived but not ephemeral) threads.

Once you complete this, you should have an abstract block diagram that gives a decent map of "What code is running in the system". And just through the process of naming and looking over, maybe a rough idea of what the various pieces of software are doing and possibly how they relate.

You can then repeat this for the other basics in a similar fashion - data and communication interfaces. It's good to emphasize staying at a first-principles kind of abstract mindset. You know there are a finite number of ways a process or thread can communicate or create side-effects outside of itself. If you literally just find all of them (not the details of what is happening over those interfaces), usually it ends up being quite few, and all of a sudden the complexity becomes less intimidating. You have a little box that does some manipulation of data via logic and state, and it goes in one pipe and out the other.

I should point out how difficult all this is largely derives from those "coding practices" droned on about for benefiting maintainability, but so often get tossed. For example, say your system uses message IDs as part of an IPC mechanism. If the code followed good practice, using some kind of constant definitions shared from a single place, you can now do things like search for that message identifier and find all places it's sent/received. If some code used it's own re-definition of the same ID, or hardcoded just the raw numerical value, this becomes nearly impossible.

Also you'll need multiple diagrams. You won't be able to clearly show a complete "code execution" diagram at the same time as an interface relationship diagram or shared data sources diagram. The complexity of it will not help, it will just be more overwhelming complexity.