| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btschaegg 2092 days ago

On 1.:

As others already noted, Git has a GC-mechanism, which means that objects can still linger around in any copy of your repo for a while. And if you need to version binary files, you'd better use git-lfs or git-annex. Obviously, if you don't need them, just nuking them outright with BFG or `git filter-branch` is fine, too.

If you'd like to try git-lfs: It also includes tooling to retroactively migrate your Repository[1], but that'll re-write your history (although BFG obviously does that, too).

If GCing your repo will not reduce its size, you'll probably have to hunt down any remnant branch and/or tag that might reference the old history and thus "keep it alive".

On 2.:

I'm not entirely certain I understand you correctly. Which commits were duplicated? The resulting merge commits?

Assuming that, if you wanted to, you could probably build some shell script to get rid of them (or use something like `git checkout prod; git rebase -ir <first_commit>` and remove the duplicated merges yourself).

But from a repo perspective, this shouldn't cause too much trouble (i.e. the additional space required will be negligible), and doing so would, again, mean that you'll rewrite history, potentially causing issues for others who still have local copies referencing your old commits.

Also: If you try to go the rebase route: Make sure you understand the log Git will create for you. Using `-r`, it will preserve merges, but how they are represented is not very intuitive and you'll have to wrap your head around that first.

You could also try to achieve the same result with `git filter-branch` and `--commit-filter`. In this case, you'd probably want to write a script that only performs the `git commit-tree` command if the tree ID passed to the filter is not the same tree as the first parent commit was referencing already (this should weed out all commits that don't change anything).

[1]: https://github.com/git-lfs/git-lfs/blob/master/docs/man/git-...

Edit: As an idea for 2.:

Describing issues in commit histories as prose text is tricky :). If your problem looks different than I assumed, you could try to create a bogus Git repo that showcases the pattern in its history and put it on GitHub as a reference.

You can create "empty" commits for that via `git commit --allow-empty`.

2 comments

btschaegg 2092 days ago

Something I just thought of:

If you've GCd your repo and you're sure there are no references to old commits laying around (also keep in mind remote branches and so on!), this might help you discover large objects that are still in your "new" history:

If run in bash, this should print the 20 largest objects in your repo and their size (in bytes):

  git rev-list --all \
  | xargs -n1 git ls-tree -r \
  | awk '$2 == "blob" { print $3 }' \
  | sort -u \
  | while IFS= read blob; do
      echo "$blob $(git cat-file blob $blob | wc -c)";
  done \
  | sort -rnk2,2 \
  | head -20

If you find some blob that is too large, you could then search for its name like this:

  large_blob=<blob_id>

  git rev-list --all \
  | xargs -n1 git ls-tree -r \
  | fgrep $large_blob

link

notafraudster 2092 days ago

RE #2:

(First, thanks for everything!)

Our regular routine at the time: Imagine production and dev are even. We add 5 regular commits to dev. We PR merge dev -> production. We do not do squash and merge. We do not delete dev. Dev and master are now even but master has the merge commit as well.

Flash forward several years. Looking at history, we now see each of those 5 commits for a run of about a year appear multiple times (at the peak, 5 times). The exact same commit, same commit message, etc. This is supposed to be impossible with git and we don't know what caused it, but assume that it was a user error either trying to resolve #1 or just by people who didn't know how to merge merging. Whatever error caused it clearly occurred a few times because for a while the commits appear 5 times, then 4, then 3, then 2, then back to normal, so I suspect whatever the cause, it happened 4 times, duplicating a range of commits each time.

We'd like to go back and de-duplicate, keeping the whole history that led us to now but only having 1 of every commit. I don't fully care about branch history. I just care about blame and about commit counts for all the people who worked on the project. I am fine with the commit counts falling with the duplicate commits removed, but just not falling to 0 like it would if we started from scratch.

We are not worried about local copies (it's a collaborators-only repo that there'd be no reason for uninvolved people to fork) and right now we have no collaborators since the project is grant funded and we don't have an active grant.

That's half the reason I'm trying to clean up these nuisance issues now before other people are involved again.

link

btschaegg 2092 days ago

No problem. Glad to be of any help :)

I'm sadly still not all that sure I understand you correctly (and I'm afraid I can't tell you exactly what caused your issue), but:

If it is the commits that "do the work" that are duplicated (not the merges), I'd guess there already was someone (or some script/tool) that already rewrote history a couple of times and not everybody was aware of that.

If your team didn't look much at the history structure and try to actively shape it in a certain way, this could just have happened because someone did an innocent `git pull` after the history of the branch had been altered on the server.

`git pull` by default equates to `git fetch` and `git merge`, so if your history was altered, the branch would contain copies of the original commits (with totally new IDs), and git would "knit" the two copies together in a new merge. That means that this probably has happened around five times (since you see five copies of the oldest commits).

In this case, the hashes of the duplicated commits should be different. If that's not the case, I'd guess you have a client that visualizes the history in a weird way and the problem is something else.

Cleaning this up might then be more cumbersome, since the points you'd need to "adjust" weren't committed "back to back". If you're willing to do the work by hand, you could just string together intact pieces of the history into a clean one by using `git rebase` with `-r` and `--onto` (or `git cherry-pick`, but I've never used that one for complex topologies, so I don't know how helpful it is there)...

link

btschaegg 2092 days ago

I just did some experiments, and I think this should help identify such merge commits from the commit IDs of two copies:

Run this in bash (if you're using Windows, use "Git Bash"). Make sure you replace the `<hash>` tokens with the appropriate commit IDs. The hash it prints should be the merge commit tying the two copies together.

  commit_1=<hash>
  commit_2=<hash>
  
  awk '
          ARGIND == 1 { h[$1]++ }
          ARGIND == 2 && h[$1] { last = $1 }
          END { print last }
      ' \
      <(git log --ancestry-path --format=%H $commit_1..HEAD) \
      <(git log --ancestry-path --format=%H $commit_2..HEAD)

link