Hacker News new | ask | show | jobs
by bagrow 2716 days ago
Having taught git several times within a data science course I find two concepts especially worth extra time: WHY there is a staging area, and what is the difference between “git” and “github”.
4 comments

> WHY there is a staging area

I understand your second point, but I have a hard time understanding the difficulty with this part. Why is it hard for people to understand the idea of staging?

You put things in a box one at a time before closing the box. Does it require more explanation than that? What do people find difficult about it?

People are very used to the web "save always" style: There is one document, and you're editing it. Most people will be familiar with the traditional desktop "save" model where you have to do something to make your changes permanent.

People often then learn that there is a local file and some remote file: they can cope with a save -> upload workflow. Lots of traditional VCS turn this into a save -> commit workflow.

Git adds two stages to this that people can't see the need for without understanding the internals: an extra step between save and commit, and an extra step after commit.

(The discussion reminds me of all those people who think that if they just start by talking about monads then people will find Haskell easy and natural...)

There are a whole bunch of layers now, though they're all useful.

1. Is my document saved?

2. Are the changes staged?

3. Are the changed committed?

4. Are the changes pushed to my fork on e.g. github?

5. Are the changes merged into the upstream repository on e.g. github?

The don't need to understand the internals for this: just knowing that every save you do will be stored forever as-is makes you double-think about what you put inside
So I have a solid mental of git, and I understand the theoretical need for the staging area.

However, I find the occasions for using the staging area in practice are few and far between, for the simple reason that I can't test and execute the code that's in the staging area without also having the code from the working directory also be there. It feels like after having partially staged some of my working directory, it would be a blind commit with no guarantee that things are working.

Very rare is the situation that I can break out a list of files over here that are for feature A and some over there for feature B, and never the two shall interact.

I think this is probably what most struggle with regarding the staging area, without being able to articulate it.

I use it quite a lot, especially with `git add -p` to stage only parts of a file for an atomic commit.
I second this. It wasn't until I adopted this practice that the staging area really made sense to me. I find it helpful not just for making atomic commits, but as a way of remembering what I was actually doing, so that I can write a good commit statement.
This has never made sense to me. I've seen others say that they commit only parts of a file. How does this scenario start? Are you working on solving one problem, but then notice some other unrelated issue and fix that too, before committing the first change?
Partly, yes. Or, I'll be working on a task overall, and have to touch multiple files in the process. Then when I'm ready to commit, I review all the modified files on disk, and look for ways to break those down into smaller discrete logical changes. I prefer to avoid "big bang" commits as much as possible, because smaller individual commits are easier to inspect, easier to back out if necessary, and provide a better "story" when inspecting a file's history sometime down the road.
But then, you either never run/tested those smaller individual commits, or you have to do extra work (stash changes, test, restore stash) to do that.

I do not see why a source control system should make it easier to make a commit that hasn’t ever existed on disk and thus cannot have been tested.

I think the better model would be to stash your changes and have an diff editor between the on-disk working copy and the stashed version that allows you to commit a set of changes as several smaller, more coherent commits.

That wouldn’t guarantee that each of those intermediate commits gets tested or even built, but it would guarantee that each smaller commit is in the on-disk copy at some time.

One common scenario is that I'm working on one problem, and in the process of solving that issue do some refactoring of related code. In this case, I want to commit the refactoring (which does not change the program's behaviour) before committing the changes that do change the program's behaviour.
I typically then send that first refactoring commit to Github (on its own branch) so that it gets full CI test coverage. And then continue working on the fix/feature while it runs.
One use case is to exclude extra lines of the file you don't want to commit. For example, I might have some debug print statements in my file that I want to keep in my local copy of the file while testing, but I don't want to include in the commit I push up for review.
> Are you working on solving one problem, but then notice some other unrelated issue and fix that too, before committing the first change?

Almost. Most often it's:

- Working on solving problem A - Notice problem B - Start to solve problem B - Notice I'm getting distracted from A, and return to finish it. - Want to commit my fix for A, but don't want to lose or forget the partial work on B.

Two different approaches I might take in this situation, depending on whether B is related to A.

1. If they are related (eg, B depends on A), use `add --patch` to commit A, then finish and commit B. 2. If unrelated, use `git stash --patch` to stash B, then commit A, then switch to a different branch to finish B.

Honestly, I see the point of both stash and staging, but not both together. Too many tools for the same job. On my long list of projects to do is a git porcelain that combines some of these concepts (eg, stash and working directory which would be tied to a branch):

- Each branch would have a single stash. - When you check out a new branch, all uncommitted changes are automatically stashed. - If the branch you're switching to has anything stashed, that stash gets popped. - Any current workflow that involves stashing can be replicated by using a branch instead of a stash.

This way, branches can be thought of as "state of the working directory", which is more intuitive with the branching tree model, imo; commits are a snapshot of the repo at that point in time; and the staging area is just a way to choose what should be included in those commits.

Amending the last commit does basically the same thing and records each state in the reflog.
You never amend commits or rebase locally before pushing? I rebase before pushing almost every time.

Git’s workflow wouldn’t even be sane without the staging area. This is what allows you to fix mistakes and make your work presentable for remotes.

> Git’s workflow wouldn’t even be sane without the staging area. This is what allows you to fix mistakes and make your work presentable for remotes.

I did exactly the same diff/tidy/diff workflow when I used p4 and svn, neither of which make a distinction between "working directory" and "staging area".

Right, but p4 & svn have “checkout” which is similar to staging. Staging is part of what we get because we can edit files without having to checkout / open for edit.

P4 and svn don’t have a strict commit parentage, which is why you can push commits in those systems in any order. Git’s strict concept of parentage is what makes the staging area so important for keeping your workflow similar to p4 & svn Workflows. Without a staging area, you’d either have to always fix mistakes with new commits, which is bad, or rewrite already pushed history, which is worse.

> without having to checkout / open for edit.

The terminology is a bit different - unless configured with mandatory locking (essential for some workflows) you don't have to open for edit. You just edit stuff and it goes in the "default changelist", roughly equivalent to automatic staging.

> Without a staging area, you’d either have to always fix mistakes with new commits

Mistakes at what point? In the normal svn workflow you can review with svn diff, then when you're happy do svn commit; it's just that there's no local place you're committing to. In both cases there's a critical point, either "svn commit" or "git push".

Never, and can never remember what rebase actually means.

At work I’ll hit the squash option on gitlabs merge request which moots all local machinations.

judging by the atrocious management of remote history I've seen at workplaces, "making work presentable" is pretty far down the line of priorities
Amending commits and rebasing involve the staging area?
Usually. You can also amend and rebase remote commits, but that’s usually a big no-no.
Committing isn't a commitment. After making the first commit, you can use the `git stash` command to put the rest of your changes aside, and go through the normal test->amend loop until you're happy with that first commit. Then you just retrieve your other changes from the stash to make your second commit.

It's also possible to do this without the stash command, by making both commits right away, and testing them later. However, that would involve rebasing(?) your second commit on top of any changes you end up making to your first commit, so using the stash makes more sense to me personally.

Fwiw, stash can get you into trouble more easily than commit. It’s no more typing to commit or branch, so I recommend preferring those to stash when it makes sense, or when you’re playing with changes you don’t want to lose. Stash is handy for a bunch of things, so use it by all means, just remember that there’s often an equivalent way that is just as easy and much safer.

The git stash man page talks about this: https://git-scm.com/docs/git-stash

“If you mistakenly drop or clear stash entries, they cannot be recovered through the normal safety mechanisms.”

One of the best things about git is how big the safety net is, as long as you tell git about your changes. Almost any mistake can be fixed, so why use features that aren’t sitting over the safety net?

A scenario:

You're adding a feature to your proggie. That involves modifying the main bits to add the feature and, say, adding a couple of interfaces to internal library modules.

Split out the changes to the library modules into separate commits---it's safe because nothing uses them, they're logically separate from the feature changes (although they don't appear to have a justification without the feature), the log will be marginally cleaner, and git bisect will have more granularity.

Why is the staging area needed in such a case ? In more traditional systems, you'd just do, say, "svn commit library/" and then commit the rest. (and you could do just the same in git too without seeing the staging area)
Understanding the staging area first requires understanding the need for it: The need for atomic commits. The need to create commits that have specific changes in them and are not always a snapshot of the entire world below the git root exactly as is right now.
Yes, it requires more explanation than that. I've used git for years, and never really understood why staging is even a thing.

Your example is an implementation of the box-putting algorithm, but it doesn't need to be mirrored in the put-box CLI.

    put-close-box file1 file2
This command could encompass all the putting and closing. Since you only close boxes when you are done putting things in it, I don't see a need or purpose to split it up.

    put-box file1 file2
    close-box
A closed box (commit) is always going to contain stuff that was put in it, so why separate commands?
That's not convenient when you're putting things into the box piecemeal, especially with `git add -p`. A thing I do frequently is to run `git diff`, scan through it, and add files (or parts of files) one by one in a second terminal. Then I do a final review of the staging area (with `git diff --cached`) to make sure it only has the changes I want and commit. I'm the sole devops engineer at my company and my workflow is a bit more scattered than a typical developer's.

Anyway, `git commit file1 file2` by itself is most of the way to being the put-close-box function you want; it just doesn't work for adding/deleting files from the repo. Seems like they could make a lot of people happy by closing that gap and letting `git add` be an intermediate-level feature.

To me, that ought to be a concern of the "porcelain", although no one uses that word anymore. CLI is particularly bad at certain types of interaction. So to compensate, a mitigation is moved into the underlying model of git. That mitigation is staging. The inconvenience of "piecemeal adding" could have easily been addressed in the UI layer using a more suitable presentation, rather than forcing all clients to follow the stage/commit dichotomy.
For simple projects (like ppl experimenting with git) you will always want to save all changes. So why stage first ?
Not everyone stays a beginner forever, and it's nice to have a tool that doesn't play to the lowest common denominator. It's really not that hard to just do a "git commit -a" if you want to avoid staging.
> Not everyone stays a beginner forever

But the vast majority do, or at best become perpetual intermediates (https://blog.codinghorror.com/defending-perpetual-intermedia...).

99% of developers out there didn't need a power tool for source control (source control is already quite a power tool many devs can barely handle, even in SVN form...), yet here we are: Git is imposed everywhere, with its horrible UX.

Git's UX isn't that bad if you're only cloning projects to build them locally and keep them updated. The UX only gets really crufty as you use more and more of the features.
I think people find it difficult because for most beginners at git, they just want to put everything in the box. Having the option to put just some things in the box seems more complicated than needed. Obviously, as you get better with the tool, you realize the power of literally "staging" your changes into multiple commits, but as beginner, it's not even in your purview.
My hurdle was 15-20yrs of no staging area from previous VCSes so the extra step took some time to understand why it was needed.
Isn't the staging area closer to an intermediary box? That's where it can get confusing.
Staging puts things in the box, commit closes the box, puts it on the pile with the other boxes, and gives you a new empty staging box.
But why is it an extra step? It's basically just a "longterm" selection of what you want to commit.
Because you not always want to put everything in the box (and if you do, there's a shortcut to do it), and "git commit file1 folder/folder/ * .cpp folder/folder/ * .h ..." for a complex set would be annoying and require you to mentally keep track of it from the beginning.

Many beginners will start by always doing "git commit -a" and that's fine, as long as they know there's an alternative once they need it.

But why is the exceptional case the default?

Surely, most of the time when you go to commit, it's all the files you've changed?

My point was more why staging is a special feature that even has a name. You're basically just selecting what changes you want to commit.

What is the usecase where one needs to remember that selection for more than just a few minutes?

probably related changes grouped together
The staging area is really an extraneous concept that isn't required. It's like a commit that isn't a commit.

In Mercurial, I much prefer to just make it an actual commit in the draft phase (the default phase) and just keep rewriting that commit. Mercurial provides tools for both selectively adding and removing hunks from a commit (both `hg amend` and `hg uncommit` accept --interactive for hunk selection). If you're extra paranoid, you can make it a commit in the secret phase so it's not shared prematurely by accident.

It's pretty much functionally equivalent and doesn't require an extra location in which your code can be. It's either in your working directory or in a commit.

A bonus of this approach is that now you have a meta-history, hidden by default, of what you've "staged" and "unstaged". It's kind of like a reflog but with, in my opinion, a better UI. And of course, the index/cache/staging area in git doesn't use refs, so there's no reflog there.

I've helped move a couple teams (kicking and screaming) from TFS to git, and I start back even further than that - why is it so much more complicated than clicking a button to save and share my work, and what is the benefit of that complication?
I'm very experienced with git, approaching expert level, and I don't use the staging area. I use

    git commit --verbose --patch
and bypass the staging area entirely. I don't find it helpful.