Hacker News new | ask | show | jobs
Refactoring Large Codebases
2 points by _ot2g 1401 days ago
Over my career I've refactored a handful of gigantic codebases and I have a few tips.

First: Version control _everything_. I mean, remove `.gitignore`, and add everything to git index before you start refactoring. The worst that can happen is when you accidentally make a change, everything breaks, and you have no way to recover to a working state.

Second: If you don't already have them, write compatibility tests for all public interfaces. It is easy to fall into trap of refactoring code and running a simple program to ensure that it works, but if it is a large application, chances are that your simple program does not cover all the edge cases.

Third: Use code coverage and eliminate all dead code paths before you start refactoring. When undergoing a refactor, chances are that you don't want to port everything: backwards compatibility, patches, etc. Code coverage reports allow you to understand exactly what is and isn't used by your tests on your target platforms.

Fourth: Make a list. I start refactors by scanning the codebase without making any changes and just writing a list of things that I would like to change. Planning allows to better understand the scope and feel like you are progressing in what may be a long endeavor.

Five: Make one change at a time. That's where the list comes into play. It is easy to start just going file by file and making changes to everything, but you are not setting up yourself for success, esp. if there are thousands of files. Doing 1 problem at a time allows you to be focused and efficient solving one problem.

Six: Document breaking changes and design decisions as you refactor. Whatever you document before the guiding principles for yourself and serve as a memo for future maintainers of the codebase.

These are the things/ideas I follow when refactoring large codebases. What about you?

1 comments

Re: First

What code wouldn't already be in version control?

"I’ve seen things you people wouldn’t believe. Codebases on fire, edited in production. I watched websites uploaded from development through an FTP gate. All those files will be lost in time, like tears in rain. Time to die()."

More seriously, the volume of development happening without version control is astounding. FTP-ing an entire codebase to a production directory, with the files named ".bak.1", ".bak.2", etc. is still a common thing.

(For what it's worth I wrote a book on Modernizing Legacy Applications in PHP https://leanpub.com/mlaphp so this subject is close to my heart.)

The point was to add _everything_ to version control. Dependencies, artifacts, outputs, etc. The last project I was refactoring had git the size of 2G because of this, but it saved me a ton of headache. I was refactoring a system that I wasn't deeply familiar with. Having everything in version control allowed me to track how modifying the system code changed the outputs, and if upgrading dependencies broke the program, I could view diff to understand how dependencies changed.