Hacker News new | ask | show | jobs
by smithzvk 4999 days ago
So I'm relatively new to version control entirely, but in the last few years my group has been making a big push to institute Git. I have been wondering lately, however: how much history cleaning is expected/desirable?

When I develop, I split my commits into as many small changes as I can so that the commit messages are single topic. I thought that was basically the idea. Every once in a while I use rebase to combine a few commits that should have been done together as they all addressed the same issue. This all seems right to me. I am left with a clean history of everything I have done on a very fine grained time scale. But the large number of commits, each with little significance to whole program hides the large scale structure of the development.

However, I could use rebase to start combining loosely related commits, trading the time resolution for clarity in the commit history. There seems to be a continuum along this scale. Where is the proper place in that continuum to say this is clean enough? Also, I don't like making changes where I am losing perfectly good information.

I know that I can group certain commits by defining a branch, developing on it, then merging (non-fast-forward) back to the original. The branch should keep the grouping in the commit history. I even suppose that this is can be done after the fact using rebase with the proper amount of git-fu. Is branching and non-fast-forward merges the preferred method of grouping related commits in the history?

If so, this seems troubling as it means that partially fixing something is difficult to do with a clean history. Until the piece of the program you wish to fix is completely working, it shouldn't be merged into master because it would ruin the grouping of the related commits. This means that there can't be any partial thought's like fixing bugs as you find them, because presumably you might want to group all bug fixes of a function together, but have a distinct commit for each.

Now I'm more confused than when I started. Seriously, any references or advice on this sort of topic are welcome.

3 comments

> However, I could use rebase to start combining loosely related commits, trading the time resolution for clarity in the commit history.

In general, your commits should be the smallest atomic operation that makes sense. When people talk about 'clean history,' they're talking about working in the awesome workflow git provides:

1. Write half-written broken code. 2. Fix that code up. 3. Add some more onto that. 4. Fix a typo! 5. Forgot to update the README.

Now, you could push that to master, but then the main master is littered with commit messages like 'oops' and 'typo.' Instead, you can rebase 5-1 onto the latest master, squash them together, and have one 'nice' commit that only has the cleaned up final changes.

This is one of the most powerful things about git: in a private repo, you can commit all kinds of garbage and half-written stuff without caring. When you want to make your stuff public, rebase and squash, then send it out. Be careful though! Only rebase your own private branches, or you're gonna have a bad timeā„¢.

Okay, that is basically keeping with my current understanding (though I'm not sure how much I live up to the "only have working history in the public repo" rule).

There is the other issue I raised, however: is there a good way to group a series of commits that happen to be towards a single distinct goal. Using branches is a clear step in that direction, but it seems like a nightmare to perform a rebase like you described if the commits are mixed and I would like the end result to involve grouping via branches. That is confusing, hopefully this will clear it up:

1. Bugfix in function1. 2. Bugfix in function2. 3. New feature in function2. 4. Bugfix in function1. 5. Bugfix in function2

...and we want in the end:

      /-- 1 ---- 4 ---\
  ---<                 >--HEAD
      \- 2 -- 3 -- 5 -/
Can rebase do this easily? Is this a good idea (it seems like it is to me)? The programmer would have to confirm that the code works at every state.
So I'm not sure if I understand correctly, but let me put it this way: with a little more git craziness, you can crack apart a commit and separate it into two. This is good if you did two unrelated changes to a file, committed that, and realized you wanted two separate commits later.

The basic process is:

1. git rebase -i, and change a commit to 'edit' 2. git reset HEAD^, this 'undoes' the commit and leaves the changes in your directory as if you had written the code but hadn't committed it yet 3. git status 4. git add <filename> -p, this lets you add commits to your file a chunk at a time. first, add all the commits as a part of commit one. skip the parts you want for commit two. 5. git commit (do not do git commit -a here) and write the message for your first commit 6. now your working directory will be all the changes for commit two. git commit -a if you want all of them 7. git rebase --continue

This page[1] has a more concise answer, but leaves out the git commit -p part.

Note that if you mess up in rebase-land, you can always git rebase --abort. If you come out of the rebase and everything looks lost ('oh god I lost my data!'), use git reflog and pull up the hash of where you were before. Your data is still there.

Another note: if your commits are already separate, you can use rebase to selectively squash and reorder them. Read the manual on git rebase -i, if you rearrange commits and only squash some I think you'll get what I'm talking about.

[1] http://stackoverflow.com/questions/6217156/how-to-break-a-pr...

Switching branches is cheap, I'd say the "right" way to get a tree like you want is to have two or even five branches all the time you're working. But I suspect you could make two branches and cherry-pick different sets of commits onto them to get the result you're after. To my mind it wouldn't be worth the effort though; how often do you really care whether the code worked with only 1 and 4 applied?
Right, I would say that it isn't worth the effort. Also, I probably never care about the code with only 1 and 4 applied. So perhaps branches aren't the right way to do what I am describing.

I always saw VC as a systematic way to keep a log of my development so that I could figure out where I may have broken my code. For this purpose, having some sort of meta-data where commits can be grouped would be nice. It would also work to do something like always end my commit messages with some kind of meta-data tag that I could grep the log for. I was just wondering if there was a prescribed/built-in way for Git to handle this.

git-bisect is the standard tool for figuring out where you broke something. I don't know what it does with branching histories though, I tend to effectively linearise my history by rebasing each branch on the trunk head before merging it.
> I have been wondering lately, however: how much history cleaning is expected/desirable?

After you've published your work and someone else has checked it out, you don't want to touch your history unless there is a serious problem.

But when you're working on something, you can commit all you want, and do many commits. Then at some point you put your work up for reviews and get feedback. Then you fix the feedback and commit as many times you need to. When your code is good enough to be merged into master, you should clean up the history a little with rebase.

You should at least try to squash and rebase your commits so that there will not be any commit in the master history that is completely broken. The whole point of having a history is that you're able to go back. E.g. you might want to search the point in history where a problem originated (git bisect can automate this with a "binary search"). You cannot effectively do that if your history is full of commits that do not work (E.g. won't build or will crash all tests).

To recap: never change published history unless there is a serious issue (like you committed your database password to github). But you can and should change your local history before you publish to master so that there are no broken commits that make it difficult to walk back in history.

My workflow when working on a large project or doing multiple commits looks roughly like this:

  git checkout -b featurebranch
  git commit -am "foo"
  git commit -am "bar"
  git rebase master # to update my personal history with public history
  git commit -am "baz"
I've used different flavors of merging it back in, though. Method 1 is to `git checkout master; git diff master..featurebranch | git apply`. Method 2 is `git rebase -i HEAD~10; git checkout master; git cherry-pick featurebranch`. I'm sure there are other and better methods, but those are the ones I've used recently that I like.

After I collapse a branch down into a single commit (I rarely want a branch to become multiple commits), I typically use `git commit --amend` to modify the commit message to something fitting and push it upstream. --reset-author is also good there to properly denote the correct date/time, rather than the first commit you squashed.