Hacker News new | ask | show | jobs
by mikeash 4574 days ago
I'm not discounting it, I simply don't agree with how git implements it.

IMO the correct option is to create a new repository that has the same history as the old repository minus the offending commit (or possibly with an edited version of that commit that leaves out the offending string).

Because it creates a new repository, there's no risk of data loss in your old repository. Once you're confident that the operation succeeded, you can swap them.

I haven't had to do this for a long time, but as I recall, this is basically how svn does it. It works fine.

The problem with git is that it makes this far too easy and it works by editing existing repositories rather than creating new ones. So instead of once-in-a-blue-moon repository hacking to get rid of that password you accidentally committed, you get people rewriting history because they think the real history isn't "clean". I know a lot of people who routinely edit their local history before pushing changes to a shared repository because they don't want other people to see their true "dirty" history. This is insane.

Finally, I'm confused about something, so maybe you could clear this up for me. I keep seeing assurances that 1) git does not actually destroy any data, and you can always recover if you screw up and 2) editing history is sometimes a vital necessity for cases like when you commit passwords. You yourself made these assurances in this comment. However, 1 and 2 are obviously mutually exclusive. If you can always recover then you can't actually scrub the repository of accidentally committed passwords and the like. Which one is actually true?

3 comments

Re: 1 and 2

1) This is almost true. Anything that is committed to Git is recoverable. When you "re-write" history, Git is creating a new set of commits in the history, an "alternate history path." It does not destroy the original commits, but there is no named reference to them (unless you created a branch/tag pointing to this line of commits).

2) In this case, if you want to actually destroy these unreferenced commits, you must run "git gc". This IS a destructive command. It will remove any unreferenced commits from the repository. (gc = garbage collect). If you never garbage collect, you will always have access to anything that was ever committed. It just might be hard to find since the only reference is the ref-log (if it was recent) or the commit hash.

Since garbage collection does happen automatically after a while, it seems that the "doesn't destroy data" bit isn't completely true. But I understand that it's a fairly rare case where you're going to screw something up and then not bother to get it back until after garbage collection cleans it up.

Thanks for clarifying that.

I know a lot of people who routinely edit their local history before pushing changes to a shared repository because they don't want other people to see their true "dirty" history. This is insane.

This is no more insane than editing a source code file before you save it to the file system. Git is used as a development tool as well as version control, and developers are therefore encouraged to commit often, even if the code does not actually compile yet. There is no more need to fill the published history with all of these WIP commits than there is for me to know about every goddamn keystroke you made while you were dicking around with that config file.

Is the history stored as a text file somewhere that you can just edit? I sometimes wish git were a bit more transparent and less of a black box.
I suggest that you pick up any git tutorial out there. It will soon become less of a black box.
I've read a lot about git. The docs generally don't pick apart what's inside the .git directory.
- The history items are stored as commit objects that are identified as a SHA-1 sum of the contents (including meta-data like Authored By, Committed By, etc).

- One of those meta-data items is "Parent Commit," so if you change one item in history, it changes the SHA-1 sum of all subsequent items (because at the very least they all need to be re-parented).

- All of the commit objects are stored under .git/objects.

- Branches are just files under .git/refs/ that contain the SHA-1 sum of the most recent commit on that branch. This is why they are called 'branch pointers.' That's basically all they are.

- If you have a history of 5 commits, and make a change to the initial commit, you now have 10 commits in your .git/ directory. Your (e.g.) 'master' branch will point to the most recent 'tree' of 5 commits. The other commits will still exist in .git/objects, but there will be no branches pointing them. You can use 'git reflog' to find them, or access them by their SHA-1 sum.

- Eventually 'git gc' (gc = garbage collect) will clean out the unreferenced commits, but this happens rarely if you don't explicitly run the command.

- When you 'git push,' you are only pushing branches to the remote repo, so commits that are stored locally, which are not referenced by one of those branches you are pushing, will not be pushed out. If you have commits that you don't want to end up in limbo like this, you should 'git tag' them or create a branch (e.g. 'archive/master-2013-12' that points to them).

It looks like .git/logs contains the history. It looks like the file format is a space-separated list, with the format "$parentcommitsha1 $newcommitsha1 ... $commitmessage". That's fairly comprehensible. What are the SHA-1 sums of? Are they of the entire snapshot, or the delta? I went into objects/ and ran `sha1sum $objfile`, and the sum did not match the file name. So that remains obscure. `file $objfile` could not identify the format; it gave nonsense.

Thanks for the help.

>One of those meta-data items is "Parent Commit," so if you change one item in history, it changes the SHA-1 sum of all subsequent items (because at the very least they all need to be re-parented).

What sequence of operations would change a history item in that way?

You don't actually know how git implements it, so how can you disagree with it?

There is no such thing as "an edited version" of a commit. A commit is identified by a SHA1 hash of its index of contents. If you change one bit you get a new commit.

You're a C programmer, right? If someone gave you a specification for writing a program to implement git, without telling your what it was, you'd tell them it would take 2 weeks. And that's because you'd reckon it would take 2 hours to knock out a rough version and a couple of days to clean it up.

Seriously, it's that simple. Just go learn how it works.

I understand how it works. Of course there's such thing as "an edited version" of a commit: it's a new commit that you create by taking an existing one and altering it. If you want to argue about terminology, please be my guest, but that's all your dispute is.
If you know how it works then where did your last question come from? The bit you're "confused" about?

It's obvious what the answer is if you know how it works, so what was your point exactly?

I know how git works in general. I wasn't 100% clear on the whole garbage collection aspect of it, which is hardly a central feature.

There's a difference between "has no idea how it works" and "understands the overall structure but doesn't know every single detail".