Hacker News new | ask | show | jobs
by gmfawcett 1938 days ago
Quite a few people are suggesting that, when it's time to share your code with others, maybe you should squash/rebase it to clean things up. That's totally up to you... but just know that not everyone thinks rebasing is a good idea. See [1], for example.

[1] https://fossil-scm.org/home/doc/trunk/www/rebaseharm.md

I think we often feel the urge to rebase and squash not because it actually makes our code changes easier to understand, but because it makes us feel better about ourselves. That's a red flag. Understanding how you got to the goal -- encoding all the fumbles and disoriented thoughts right in the commit history -- that can be a genuine benefit to the reader. Who do we really help by pretending that we're more organized, coherent, and linear than we actually were?

8 comments

> Who do we really help by pretending that we're more organized, coherent, and linear than we actually were?

We're helping the future reader who's reading the history because they want to understand why a change was made - and "because the author of the branch initially had the wrong idea" is almost never the answer they're looking for.

I sometimes enjoy reading stream-of-consciousness writing, but most of the time (especially when reading code) I'm more interested in the point itself. The same applies to version history. It can be used to tell the raw story, but there's usually a more useful and interesting story to be told.

Exactly. I want to "tell a story" with my commits, and that story is really more of an idealized retelling of what I actually did.

Five years from now, no one needs to know that I forgot to add that one line to a prior commit and had to add it separately, or that my first attempt didn't quite pan out as expected.

What that future person _will_ care about is:

- What final changes actually got made?

- What task was I working on?

- What was the reason for any of these changes in the first place?

- Why did I make some of these changes specifically to implement that task?

- What additional side info is important context for understanding the diffs?

Exactly. It's also great to compartmentalize different aspects of your change.

Often my changes are

1. Refactor the existing code to support the new feature

2. Add the new feature

It's great to keep these separate, because someone can look at number 1 and see that the two versions of the code ought to be functionally the same (same tests pass, app looks the same, refactor is easy-to-understand), and look at number 2 and see the new feature.

There are countless other times where you want to tell the "story" in a logical fashion.

(Honestly, I expect that there is a significant correlation between being a good git committed and being a clear story-teller.)

I understand that you want to tell a story. But as someone examining your code, I also want to know how you got there. While you're throwing out your junk, you're also throwing away valuable information. If I'm taking the actual time to review the code history, then let me play it out in real-time, mistakes and all. I know how to step back and summarize, I don't need you do do that for me.

This is especially true if your code is clever. I'm much more likely to understand your polished gem if I can see all the things that you bumped into while you were discovering it.

That is what comments and commit messages are for. I trawl history all the time. Running into an unbisectable mess of a branch (because a bug that was introduced in commit X~15 is fixed in X on the same branch) is a complete nightmare. I have to discect the branch history and understand what is because of the branch and what is debugging/review/CI cycle cleanups. Commit messages for fixups also tend to be 100% terrible and utter trash. "Fix review comments". Thanks. If we're doing that, let's copy what the comment was in too and why it fixes it.

The problem with your request is that 90+% of the time (with the way I develop), the dead ends are on MRs that got closed or code that never got pushed in the first place. So again, comments as to why this approach is used is way better than hiding it in the history because someone coming to "clean up" code sees the thought process instead of having to remember to search for it.

I don't do much work like that -- I suspect you're part of a much larger developer team -- but I think I understand the problem you're describing.

Couldn't you simply review/bisect at the fork/join points? i.e., take the commits at which forks began or ended, ignore any intermediary commits, and run the bisect (or, read diffs) across that subset? That way you're only comparing at the chapter-markers of the story, so to speak, and not getting mired in the gory details.

Yes, `git bisect --first-parent` was a feature I wanted for a long time. It finally exists now, so yes that helps, but is not a complete solution.

Even with `bisect --first-parent`, I still want useful commit messages which "fixup" commits, again, are uniquely terrible at being on the whole.

I do software process and other things, so some of my branches tend to be gigantic (e.g., revamping the build system) and can be 200+ commits because one cannot meaningfully land a build system rewrite incrementally. That one in particular was meant to be bisectable because when rebasing on top of new development, I wanted fixes to be in the "port this library over" commit instead of after some random merge commit based on when I decided to sync up that week (it took a year to do it). So once I get it down to a particular MR, being able to inspect that topic is still a useful property.

Note that this only works with a `merge --no-ff` workflow too. The `rebase && merge --ff-only` pattern and `merge --squashed` are both terrible, IME, at making useful history. The force-rebase workflow is just as confounding to me as the no-rebase workflow (the former de-parallelizes your MR merge process and the latter tends to make a terrible commit history).

Note that even for single-developer projects I run, I tend to make PRs even for my own changes (once it's gotten off the ground).

While I understand and somewhat empathize with this desire (I'd use it all the time for personal repos, for example)... current VCS systems are terrible at supporting it.

What you probably want in this case is something like "automatically commit on every change (possibly recording every keystroke)" + "automatically tag based on tests/builds passing or failing" + "allow manual comments at any time, whether based on files changing or not". All of that is technically possible with git/hg/fossil/etc, but it's so much work for both the recorder and the viewer that it's infeasible.

This is great, except that we’re often bad at recounting this idealized history without lying in ways that make later maintenance more difficult
> or that my first attempt didn't quite pan out as expected.

Actually that's still important, it's just important from an architecture perspective.

As a much newer developer, the biggest problem I have with git is that I rarely end up actually making one change at a time. I'll be working on some larger thing, and in the process I'll notice and quickly fix a smaller thing before returning to the original task. This might be a typo in a code comment, a poorly named variable, or a block of code I realize is dead.

I suspect this is the type of tendency which goes away with experience, but it makes git a lot less useful. My commits won't really tell you what changed; the most they can tell you is the primary change I was working on.

Many of us do that, and it's not just a new developer thing. Git actually enables this, because you get to pick and choose what to add to the index (`git add`) before committing. So that little tweak you made in the unrelated function? -- no problem, just `git add` that later, and commit it under a different message. Not all SCM tools give you that kind of flexibility.

On the other hand, there's a diminishing return to placing every tiny change into a separate commit. Commit messages like "Fixed multiple small things" might make some people clutch their pearls, but sometimes you just need to get shit done and move on to solving bigger problems.

My suggestion is to consider breaking your commit into two: one for "fixed this big issue that everyone cares about", and one for "a bunch of tiny cleanup stuff that I happened to notice." (Maybe call that second one "refactoring" -- it will go over better with your audience.)

> Git actually enables this, because you get to pick and choose what to add to the index (`git add`) before committing.

That assumes the changes are in separate files though, right? I know you can do use the "-i" flag, but it's fairly labor intensive.

That kind of depends on your tooling. e.g., I use Magit (an Emacs front-end for git) which makes interactive mode really, really easy.

(But easy or not, other version control systems such as Subversion don't offer the feature at all. We kind of take Git for granted these days, but it wasn't always like that.)

A lighter weight option is the --patch flag to 'git add' and 'git commit'.
Personally I have gotten used to using `git commit --patch` for everything (even if I only have one change) just as a convenient way of reviewing the changes I am about to commit. With that, only committing part of the changes is no additional effort.
Look at `git add -i`. You can commit just part of a change to a file. So if you notice a small problem and already have a bunch of changes made, you can still make those changes, and commit them separately.

Up to you if you wanted to rebase those changes back onto main.

I don't use it often and find it's kind of painful to use, but if you're in the position where you've already saved two different things in your IDE and need to pull them apart for commit, it's a useful tool.

Have you tried using 'git commit --patch'? It makes it easy to separate out unrelated changes when committing. You can precede it with an invocation of 'git reset $HASH' to restructure your last few commits.

In general, more experienced git users aren't actually working on one commit at a time. They're just comfortable enough with editing history to make it look that way.

That is what OP was talking about: after you've done the change, make a commit with just that tiny refactoring. Once you're done and ready to review your work, you can cherry pick just that fix and move it to main / master / it's own PR. Since it is self-contained, it can be processed by itself only.
`git add -p` will take you through all the changes in your files, and let you add them selectively. I find this makes for much cleaner commits.
With the add command's interactive mode, it is often possible to selectively stage and commit individual patches in a file.

https://git-scm.com/book/en/v2/Git-Tools-Interactive-Staging

> We're helping the future reader who's reading the history because they want to understand why a change was made...

"Change" is a subject to interpretation. Most of the time it's the scope that the change belongs to is what has the meaningful value.

Say, changes made in connection to fixing an issue are logically tied for inclusion as well as for potential unwinding.

Some tangent changes technically should not be casually folded in, just in case this changeset will need to be propagated or rolled back.

Thus this elaborate muli-staged commit management in Git.

Many projects don't have such need to manange the change flow, so Version control is used as a kind of undo buffer. Which is fine, in such cases the meaning is tied to release states.

If anything, it makes more practical sense to preserve only commits with a buildable state, not just some transitional changes.

The advantage if that you get a more usable and understandable list of historical changes. "You wouldn't publish the first draft of a book" [1]

A squashed merge or rebased and cleaned set of commits gives a very clean overview of which changes where made, at what point, why they were made, and what together. That picture tends to get utterly lost in the "set up X", "make test Y", "fix typo", "wip" and "change error handling" commits a feature branch typically has.

Additionally I'm not really interested in that my colleague started change X yesterday before lunch, I'm interested in when it went live and became visible for the all developers when it was merged into the main branch.

[1] https://git-scm.com/book/en/v2/Git-Branching-Rebasing#_rebas...

You wouldn't publish a first draft, but neither would you burn it once the final draft was off to the printer. Personally, I'd prefer it if "squashing" commits was purely a UI thing; the underlying commits were all still there, but grouped together and displayed as a single big "virtual" commit. That way you could still drill down to the real history if you needed to.
Why would you want to see every typo that was corrected? Every little test that was changed erroneously and then backed out again?

That may be an accurate representation of the order savepoints were made, but it's not an accurate representation of how the software evolved. It is noise that needs to be discarded if a reader would like to know what change was really made. It also makes if difficult or impossible to use tools like git bisect.

Is the argument really that a more detailed history is always better? In the trivial case every keypress could be a savepoint, and every savepoint a commit.

One does not always know in advance that a commit needs to be split in two. The only way to produce readable commits without rebasing them in that case is to work with local _backup files. A version control system does this much better.

In fairness, you're only seeing 5% of the typos. We caught the other 95% before committing. :)

I love your question, "why not a commit per keypress?", because it raises an interesting follow-up: why not squash and rebase entire months or years of project work into single commits? If squashing is so useful, why do we only apply it at low-grain scales? Could we read and understand massive projects quickly and easily, if they only had a few commits to them?

I'm sure that we don't experiment with larger-scale rebases because of the limitations in the technology -- we all know that we're not supposed to 'git rebase' in public, and why that is. But suppose those obstacles were lifted. Now that we can rebase and rewrite at any time scale, which scale(s) is the right one(s) to choose?

> why not squash and rebase entire months or years of project work into single commits?

The argument here is that one should rebase and carefully craft commits that isolates each functional change into a separate commit, where each change is motivated and builds on previous, before pushing anything. Every commit should build cleanly, preferably even pass tests. That makes changes easier to reason about, and enables the use of tools such as bisect. Look at git itself for an example of this type of history.

The counter argument to that was that it presents a false view of history. Maybe there were false starts and mistakes made along the way. Without preserving these to history the reader is left without understanding these. This is not an uncommon argument. Some people argue rebase should never be used.

This view suggests that a more detailed history is preferable. Taken to its logical extreme, that would mean every keypress and editor command.

But "why not delete all of history" is not an example of "carefully crafted commits" taken to an extreme. Quite the opposite.

Basically, you want to keep the history of individual logical patches to the codebase, but not the meta-history of how those patches were made.
> it raises an interesting follow-up: why not squash and rebase entire months or years of project work into single commits?

That's effectively what happened before version control/before the small-scale rebases we enjoy now were possible. And the reason is that it's hugely valuable in certain circumstances to be able to see some granularity of the history. (Though clearly people disagree about what the grain size should be.)

> Could we read and understand massive projects quickly and easily, if they only had a few commits to them?

I don't think so. The current state is visible at the top of the git tree regardless. History comes in when you are trying to understand why the state is what it is. Usually this is for troubleshooting in my experience, but sometimes also when doing a refactor. Meaningful commit messages attached to meaningfully-clumped patches are, in my opinion, absolute gold in those cases.

There's little benefit to squashing down a year's worth of work into 5 commits because you can just as easily tag each of those 5 commits with a version number, give it a little write up, and call it a release.

I think the reason to squash commits is to cut out the noisy bits that were only useful to the original developer that day and create a timeline that's helpful for future readers. It doesn't really make sense to get more granular than the level of a single commit with a good comment and a small set of cohesive changes. So you store your history at that granular level and you can take care of the rest with tags, minor and major versions, etc.

The Fossil designer agrees with you:

"So, another way of thinking about rebase is that it is a kind of merge that intentionally forgets some details in order to not overwhelm the weak history display mechanisms available in Git. Wouldn't it be better, less error-prone, and easier on users to enhance the history display mechanisms in Git so that rebasing for a clean, linear history became unnecessary?"

I'm not a user of it myself, but I believe this is the philosophy behind how Fossil approaches it:

https://fossil-scm.org/home/doc/trunk/www/rebaseharm.md

Pull requests can serve the same purpose; messy feature branches and a clean main trunk.
The only way you get that in Git is if you squash-and-rebase before merge, though. Which is fine if that's the process and end result that you want, but does (if you keep feature branches "messy") disconnect feature branches from their related merges into trunk from Git's point of view.
Yeah, you're reliant on Github metadata to make those links for you; there's nothing natively in git itself doing it. It's also an all-or-nothing affair, where the whole PR becomes a single squashed commit. To get anything in between ("here's my single large PR which I've rebased into N incremental commits, but you can also dig in and see the work that actually led here"), you really do need first class support in the tool.

I suppose the Github answer to all this would be "just make separate PRs", but going that way asks a lot more of the developer in terms of how polished those incremental states need to be.

Mercurial does this with the Evolve extension.

https://www.mercurial-scm.org/doc/evolution/user-guide.html#...

It still has the individual commits, but the interface will make it appear as if it's just one commit.

The real history is useless. Especially if we have tests. In that case it doesn’t matter how often we make changes.

I do think this is because I prefer to think of code as a black box. No one should need to figure out how my functions work. Someone should just need the name of the function, what inputs it receives, and what output does it return. If someone actually has to read my code, that’s a failure.

> If someone actually has to read my code, that’s a failure.

I can't tell if you're being serious, or are a brilliant troll. :)

Assuming you're serious, Hyrum's Law is one reason I might need to see your code (https://www.hyrumslaw.com/). The signature of your function is not the whole signature, it's just a sketch of the high points.

You really should just need to read the code in case something goes wrong, but otherwise, no. You need to be more careful with your time.
> Who do we really help by pretending that we're more organized, coherent, and linear than we actually were?

You help the reviewer.

To understand why git is the way it is, you have to understand the workflow of the original git-using project (other than git itself), the Linux kernel. Whenever someone proposes a change to the Linux kernel, it's sent as a sequence of patches. Each patch should contain a single logical change, and will be reviewed individually. For instance, suppose you want to change the way a field in a particular structure is stored. The first patch of your series might introduce a couple of helper functions to access the structure fields. Patches 2-5 might each change a separate subsystem to use the new helper functions, instead of accessing the field directly. The next patch changes both the field and the helper functions to use the new representation. When reviewing this sequence, it's easier to see that each patch is correct. And that was a simple example; it's not rare to have patch series with over 15 patches, and even longer patch series are not unheard of. I've seen patch series which refactor whole subsystems, where each patch in the series was an obviously correct transformation, while the final result was completely different.

From the Fossil page: > Rebasing is lying about the project history

This tired hyperbole just won’t seem to ever go away. Please try to ignore this junk, the Fossil devs could and should make their point without the FUD and misleading judgement, if they want to be taken seriously. Rebase has perfectly legitimate uses, and if Fossil makes it so you don’t need to rebase, that’s fantastic.

Rebase is most useful before pushing local changes to other people, and most people fluent in git know this fact, and also know that you don’t rebase public branches, you don’t rebase other people’s commits or your own after they’re pushed, except in emergencies and with team communication.

Rebasing before you push is the same amount of “lying” as typing something into your editor and then deleting it before you hit save. You don’t actually want your history at the raw keystroke level, right? You aren’t “lying” if you fix a bug you wrote before you push the bug into public branches, right?

> Understanding how you got to the goal -- encoding all the fumbles and disoriented thoughts right in the commit history -- that can be a genuine benefit to the reader.

Disagree.

Sorry, but I'd rather be rather inclined to read commit history like this: (whether it's reviewing others' code or my own at a later time)

- Add functionality X to function y()

- Fix a bug in y(): ...

- Fix a bug in z(): ...

than

- X

- oops

- fuck, typo fix

- do it another way

- ok, y is fixed now

- another typo fix

- it has a bug, fix it

- z has the same bug

- typo fix

Whereas the latter can be quite common during dev cycle so as to keep it to yourself. It's not about 'pretending' at all.

I think that's a pretty valid argument about just wanting to rewrite history.

I'll offer an alternative. I love having every commit buildable. When I'm drafting, this isn't going to happen. I'd like to save my work and move between machines more frequently than that. But after a rebase, it's great to only have compiling commits. It makes doing a bisect a lot easier when you're hunting for something.

I have found this works a charm, if I want to present a clean repo (for things like tutorials and classes): https://24ways.org/2013/keeping-parts-of-your-codebase-priva...

But basically, I let things "all hang out."

Tools shouldn't really be running the show.

My commit history is often a descent into profane madness.