| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nephrenka 2672 days ago

A large codebase under active development presents a moving target; Even if you knew how something worked last week, that code might have changed twice since then. Detailed knowledge in the solution domain gets outdated fast.

To address this issues, I work with something I call a behavioral code analysis. In a behavioral code analysis, you prioritize the code based on its relative importance and the likelihood that you will have to work with it and, hence, needs to understand that part. Behavioral code analysis is based on data from how the organization works with the code, and I use version-control data (e.g. Git) as the primary data source. More specifically, I look to identify hotspots. A hotspot is complicated code that the organization has to work with often. So it's a combination of static properties of the code (complexity, dependencies, abstraction levels, etc) and -- more important -- a temporal dimension like change frequency (how often do you need to modify the code?) and evolutionary trends.

I have found that identifying and visualizing hotspots speeds up my on-boarding time significantly as I can focus my learning on the parts of the code that are likely to be central to the solution. In addition, a hotspot visualization provides a mental map that makes it easier to mentally fit the codebase into our head.

There are a set of public examples and showcases based on the CodeScene tool here: https://codescene.io/showcase

I have an article that explains hotspots and behavioral code analysis in more depth here: https://empear.com/blog/prioritize-technical-debt/

I also have a book, Software Design X-Rays: Fix Technical Debt with Behavioral Code Analysis, that goes into more details and use cases that you might find useful for working with large codebases: https://pragprog.com/book/atevol/software-design-x-rays

2 comments

thecupisblue 2672 days ago

Nice! I'm working on internal tooling for us that does a lot of the same things - gonna buy the book, thanks for that, weird I've never heard about it. For now I'm measuring: churn, complexity, linting, test coverage, test quality and am going to add a dependency graph. It seems to me that churn, complexity and dependencies are the biggest indicators of a hotspot.

Got any tips for possible problems I'll encounter along the way?

link

nephrenka 2672 days ago

Cool - thanks! While the measures a simple in theory, there are some practical challenges; git repositories tend to be messy. So part of the practical challenge is to clean the input data (e.g. filter out auto-generated content, checked in third party libraries).

Another challenge is that version-control data is quite file centric, while many actionable insights are on a higher architectural level. In CodeScene we solve this by aggregating files into logical components that can then be presented and scored.

link

thecupisblue 2671 days ago

I'm already onto data cleanup, tbh I'm focusing on Android repos for now. But great idea, now that I think of it as an architect I'd mostly like the option to group a few classes/packages or let's call them "modules" together and then visually see which pieces are too dependent on outside sources and which are the hotspots/connections/dependencies inside that group.

There goes my weekend...

link

skillet-thief 2672 days ago

What is nice about this approach is that it gives you an instant map of the codebase (where the important parts are, etc.) that you really just can't get any other way.

link