Hacker News new | ask | show | jobs
by rdubz 3313 days ago
A problem with rebase workflows that I don't see addressed (here or in the replies) is: if I have, say, 20 local commits and am rebasing them on top of some upstream, I have to fix conflicts up to 20 times; in general I will have to stop to fix conflicts at least as many times as I would have to while merging (namely 0 or 1 times).

Moreover, resolution work during a rebase creates​ a fake history that does not reflect how the work was actually done, which is antithetical to the spirit of version control, in a sense.

A result of this is the loss of any ability to distinguish between bugs introduced in the original code (pre-rebase) vs. bugs introduced while resolving conflicts (which are arguably more likely in the rebase case since the total amount of conflict-resolving can be greater).

It comes down to Resolution Work is Real Work: your code is different before and after resolution (possibly in ways you didn't intend!), and rebasing to keep the illusion of a total ordering of commits is a bit of an outdated/misuse of abstractions we now have available that can understand projects' evolution in a more sophisticated way.

I was a dedicated rebaser for many years but have since decided that merging is superior, though we're still at the early stages of having sufficient tooling and awareness to properly leverage the more powerful "merge" abstraction, imho.

2 comments

Well, git rerere helps here, though, honestly, this never happens to me even when I have 20 commits. Also, this is what you want, as it makes your commits easier to understand by others. Otherwise, with thousands of developers your merge graph is going to be a pile of incomprehensible spaghetti, and good luck cherry-picking commits into old release patch branches!

Ah, right, that's another reason to rebase: because your history is clean, linear, and merge-free, it makes it easier to pick commits from the mainline into release maintenance branches.

The "fake history" argument is no good. Who wants to see your "fix typo" commits if you never pushed code that needed them in the first place? I truly don't care how you worked your commits. I only care about the end result. Besides, if you have thousands of developers, each on a branch, each merging, then the upstream history will have an incomprehensible (i.e., _useless_) merge graph. History needs to be useful to those who will need it. Keep it clean to make it easier on them.

Rebase _is_ the "more powerful merge abstraction", IMO.

rebase : centralized repo :: merge : decentralized repo

rebase : linked-list :: merge : DAG

If the work/repo is truly distributed and there isn't a single permanently-authoritative repo, a "clean, linear" history is nonsensical to even try to reason about.

In all cases it is a crutch: useful (and nice, and sufficient!) in simple settings, but restricting/misleading in more complex ones (to the point of causing many developers to not see the negative space).

You can get very far thinking of a project as a linked list, but there is a lot to be gained from being able to work effectively with DAGs when a more complex model would better fit the reality being modeled.

It's harder to grok the DAG world because the tooling is less mature, the abstractions are more complex (and powerful!), and almost all the time and money up to now has explored the hub-and-spoke model.

In many areas of technology, however, better tooling and socialization around moving from linked-lists (and even trees) to DAGs is going to unlock more advanced capabilities.

Final point: rebasing is just glorified cherry-picking. Cherry-picking definitely also has a role in a merge-focused/less-centralized world, but merges add something totally new on top of cherry-picking, which rebase does not.

As @zeckalpha says, rebase != centralized repo.

You can have a hierarchical repo system (as we did at Sun).

Or you can have multiple hierarchies, contributing different series of rebased patches up the chain in each hierarchy.

Another possibility is that you are not contributing patches upstream but still have multiple upstreams. Even in this case your best bet is as follows: drop your local patches (save them in a branch), merge one of the upstreams, merge the other, re-apply (cherry-pick, rebase) your commits on top of the new merged head. This is nice because it lets you merge just the upstreams first, then your commits, and you're always left in a situation where your commits are easy to ID: they're the ones on top.

I'm the guy who started this DAG model (also at Sun with NSElite and then later with BitKeeper).

I agree that rebase == centralized. It's a math thing. If you rebase and someone has a clone of your work prior to the rebase chaos happens when they come together. So you have to enforce a centralized flow to make it work in all cases. It's pretty much provable as in a math proof.

Not true! At Sun we did this with project gates regularly. The way it works (as I've described several times in this thread now) is that you rebase --onto. That is, you use a tag for the pre-rebase project upstream to find the merge base for your branch, then cherry-pick your commits (i.e., all local commits after the merge base) onto the post-rebase project upstream.

Now, you don't want to do this with the ultimate upstream, though occasionally it happened at Sun with the OS/Net gate, usually due to some toxic commit that was best eliminated from the history rather than reverted, or through some accident.

But you'd be right to say that the Sun model was centralized in that there was just one ultimate upstream. (There was one per-"consolidation", since Solaris was broken up into multiple parts like that, but whatever, the point stands.)

Whereas with Linux, say, one might have multiple kernel gates kept by different gatekeepers. Still, if you're contributing to more than one of them, it's easier to cherry-pick (rebase!) your commits onto each upstream than to just merge your way around -- IMO. I.e., you can have a Linux kernel like decentralized dev model and still rebase.

However, I as you can see from my comment in the previous paragraph, _rebase_ itself does not imply a centralized model.

I get that you can work around the problems, you don't seem to get that from a math point of view, rebase forces either

a) a centralized model

or

b) you have to throw away any work based on the dag before the rebase

or)

c) you have the history in the graph twice (which causes no end of problems).

(a) is the math way, (b) and (c) are ad-hoc hacks. You are well into the ad-hoc hacks, you've found a way to make it work but it includes "don't do that" warnings to users. My experience is that you don't want to have work flows that include "don't do that". Users will do that.

Also, it's harder to grok merge history because we humans have a hard time with complexity, and merge history in a system with thousands of developers and multiple upstreams can get insanely complex. The only way to cut through that complexity is to make sure that each upstream ends up with linear history -- that is: to rebase downstreams.
Nope, you want what I called the event stack. It lets you have your cake and eat it too.

The event stack is a record of every tip that was ever present in this repo other than unpushed commits.

You were at cset 1234, you pull in 25 csets, the event stack has two events, 1 which points to 1234 and 2 which points at the tip after the pull.

You commit "wacked the crap out of it", then commit "fixed typo", then commit "added test", then commit $whatever. The event stack is

1 2 . which points at your current tip but is floating

Now you push. Your event stack is 1, 2, 3 and 3 points at the tip as of your push.

What about clone? You get your parent's event stack but other than that they are per repo.

The event stack is the linear history you want, it is the view that everyone wants. It's "what are the list of tips I care about in this repo?". Have a push that broke your tree but you don't know what the previous tip was because the push pushed 2500 commits? No problem. The event stack is a stack and there is a "pop" command that pops off the last change to the event stack. So you would just do "git pop" and see if that fixes your tree, repeat until it does.

We never built this in BitKeeper but I should try. If for no other reason than to show people you can have the messy (but historically accurate) history under the covers but have a linear view that is pleasant for humans.

Yes, I've been asking for branch history (the reflog provides some, but it's insufficient because it's not shared in any way).

Even with this, I'd want to rebase away "fixed typo" prior to pushing, and more, I'd want to:

- organize commits into logical chunks so that they might be cherry-picked (in the literal sense, not just the VCS sense) into maintenance release branches

- organize commits as the upstream prefers (some prefer to see test updates in separate commits)

IIUC BitKeeper does have a sort of branch push history, unlike git. Is this wrong?

So the current BK doesn't really have branches, it has the model that if you want to branch you clone, each clone is a branch.

Which begs the question "how do you do dev vs stable branches?" And the answer is that we have a central clone called "dev" and a central clone called "stable". In our case we have work:/home/bk/stable and work:/home/bk/dev. User repos are in work:/home/bk/$USER/dev-feature1 and work:/home/bk/$USER/stable-bugfix123.

We run a bkd in work:/home so our urls are

    bk://work/dev
    bk://work/$USER/dev-feature1
BK has a concept of a level - you can't push from a higher level to a lower level. So stable would be level 1, dev would be level 2. Levels propogate on clone so when you do

    bk clone bk://work/dev dev-feature2
and then try and do

    bk push bk://work/stable
it will tell you that you can't push to a lower level. This prevents backflow of all the new feature work into your stable tree.

The model works well until you have huge (like 10GB and bigger) repos. At that point you really want branches because you don't want to clone 10GB to do a bugfix.

Though we addressed that problem, to some extent, by having nested collections (think submodules that actually support all workflows, unlike git, they are submodules that work). So you can clone the subset you need to do your bugfix.

But yeah, there are cases where "a branch is a clone" just doesn't scale, no question. But where it does work it's a super simple and pleasant model

Decentralization at scale can result in a linear chain, too.
IMO, VC comes down not to tracking what was actually done, but to creating snapshots of logical steps that are reasonable to roll back to and git bisect with.
And cherry-pick onto release maintenance branches.