How to represent those snapshots, and fix the storage bloat a naive implementation would cause, is a completely different problem.
One of the things that makes Git smart is that it doesn't try to optimize things prematurely. SVN and co. would store actual diff data, but this made some operations really hard to implement (and, in many cases, slow).
Git has commits conceptually as snapshots. It's up to the storage code to figure out how to deal with this.
> But I find it far more intuitive and useful to think of commits as "diffs + some metadata".
Except that this is not what's happening. I wouldn't even call it an abstraction, it's how things actually work. What you call abstractions are actually operations. If we run a diff we are interested in the changes, but if you ask git to show you the commit it will show you just that.
If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.
> If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.
I find that thinking of commits as snapshots is not so useful. I prefer to think of them as a pair of parent commit and diff.
With that in mind, things like rebase become obvious: Take the same diff and attempt to apply it to a different parent.
It's not clear to me how thinking of commits as snapshots helps me to explain operations such as rebase.
I do concede, however, that "git cat" (I think that's the command) seems more closely related to a snapshot: you identify a commit and a file, and it will give you the content of that file at that commit. Clearly in this case the concept of a snapshot works well. But I need this very rarely.
> With that in mind, things like rebase become obvious: Take the same diff and attempt to apply it to a different parent.
You can think of it that way if you want. But it's not what Git actually does.
Personally I much prefer to have my mental model match the actual reality of things.
You may not use "git cat" very often, but what about "git checkout <SHA>"? If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff up to the SHA you asked for.
What it does in actuality is find the snapshot of that SHA and change the working tree to match it.
If git did rebuild the graph, right from the very first commit, the end result of the operation would look identical to the user as it does now.
It seems to me the two mental models are interchangeable when it comes to the use of git from the users point of view. What is missing, from the users point of view, when they model commits as diffs+parents vs as snapshots?
Now I think about it, it's probably that users have a bad understanding of the commit-as-diff models; they could similarly have a bad understanding of the commit-as-snapshot model I expect, I don't know that thinking in snapshots helps to understand git from an users point of view better than thinking (properly) in diffs.
The article for example explains that any two commits can be differenced because the underlying snapshot trees can be compared, but the commit-as-diff model can as easily explain why comparing two commits works by tracing each commit back to the common base commit - so the commit-as-diff mental model just needs to remember that commits are fundamentally tied to the path they have back to the root commit.
It seems to me if you take the diagrams from the article and remove the under-the-covers stuff leaving just the circles, the commits-as-diffs and commits-as-snapshots models look exactly the same.
Merge commits are a bit hard to understand from the perspective of "a commit is basically just a parent commit plus diff".
On the flip side, cherry-picking is hard to understand from the perspective of "a commit is basically just a snapshot, nothing more" (it's _also_ weird from the parent-commit-plus-diff perspective -- cherry-pick is kind of a weird operation, but useful enough that we keep it anyway despite it not fitting quite as cleanly into the git model as other operations).
Outside those edge cases, though, people with "snapshot" and "parent + diff" mental models will make basically identical predictions about what the results of various operations with git will be.
> What is missing, from the users point of view, when they model commits as diffs+parents vs as snapshots?
With the wrong mental model it's harder to predict what operations are expensive. If "git checkout <SHA>" truly did have to replay all diffs from the beginning of time, it would be a very expensive operation that is best avoided unless you absolutely need it. But in practice it is a very fast operation (one of the fastest) that there is no need to shy away from.
A fair point possibly, but given checkout/switching branches is probably just about the most common action when working with git repos, I'd hope people would notice that it's fast pretty quickly.
> You may not use "git cat" very often, but what about "git checkout <SHA>"? If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff up to the SHA you asked for.
Yes, this is true. I don't know why it never bothers me. Maybe it's because you could also store the diffs in the opposite direction (i.e. store the tip of each branch in the clear, then store diffs from each commit to its parent). Computing the inverse of a diff should be a quick operation. Usually, when you check out something, it's the tip of a branch or near the tip of a branch.
Anyway.
Of course I know that storing trees makes it easy to compute diffs. Computing diffs will becomes slower with larger trees. On the other hand, storing diffs makes it slow to compute trees, and the more commits we've got, the slower the tree computation goes.
> Computing diffs will becomes slower with larger trees
Not usually. Computing a diff is roughly O(n) with the size of a diff. This is because unchanged leaves of the tree can be seen as identical (because the are content addressed) and are skipped. So to compute the diff you only need to recurse into changed directories.
So having a million files in the root directory and one has changed is very fast to diff as you just diff that one file. The worse case is the diff happening in a very deeply nested directory with lots of files in each of the subdirectories but even that is quite cheap as diffing a sorted directory listing is O(n) with the size of the listing.
(The actual worst case is diffing large files as most text diff algorithms are worse than O(n))
> If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff
Well, it would usually be more efficient to figure out where the current checked out branch differ from the branch that is checked out, and then unapply and apply diffs as needed.
Rebase doesn't work that way, though [0]. It first extracts the 3 versions (2 leafs and their common ancestor) and then does a diff & patch.
This allows git to store the deltas between versions in the most efficient way on disk, while also letting it use contextual diffs to minimize the chance of spurious merge conflicts. Patching algorithms have various heuristics that make sense for programming languages, like special treatment for lines with only changes in whitespace.
(Edited to add:) also, minimal diff algorithms have to do a lot of work to detect large blocks of text being moved around. This is part of what made Subversion, which used the same diff algorithm for storage compression and merging, painfully slow.
Here is the paragraph that describes what rebase does:
> This operation works by going to the common ancestor of the two branches (the one you’re on and the one you’re rebasing onto), getting the diff introduced by each commit of the branch you’re on, saving those diffs to temporary files, resetting the current branch to the same commit as the branch you are rebasing onto, and finally applying each change in turn.
Is "applying the diff to a different parent" not a good way to describe this?
You're using the word 'diff' for 2 different things:
- an efficient way to store 2 very similar files
- the minimal set of changes made by a programmer to a file.
Subversion uses the same diff algorithm for these 2 functions, which is why people conflate them. But git uses different algorithms. The first one (which it calls deltas) are optimized for speed and compression ratio. The second set of algorithms (you can choose from a few, some of which are better at identifying rearrangements of large blocks of text) are optimized for merging 2 programmer's changes without conflicts.
The way you try to apply a diff to a different parent is by doing a three-way merge... the vast majority of tools do this by taking three files as arguments and producing a fourth as output. The three-way merge is the underlying process which makes merge, rebase, cherry-pick, and revert work. They are all just "three-way merge, shuffle the arguments around, and adjust metadata".
The parent + diff storage is not isomorphic to snapshot storage. Snapshot storage reflects the actual usage of VCS tools... people make changes, and record the final state. Parent + diff does not do this, it records the changes, which requires creating a diff, and there are multiple ways to create a diff between two snapshots.
Git postpones the "which diff is correct" question until you actually care about the answer.
> If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.
I don't think those concepts are distinct as you're painting them. At a user visible level commits will almost always be visualized as diffs, which puts us at a place where - at the highest level and lowest level they're defined as pretty close to diffs, while at an intermediary level they're defined closer to snapshots.
I honestly think they're neither, each expression method (diff vs. snapshot) can be translated pretty easily and both are trying to represent the same end goal. It can be helpful to know that commits are representative of the full state of the codebase that exists at a time, but that view can be at odds with merging and rebasing which use actual change sets to calculate - when a commit is being manipulated it's helpful to view it as a diff (and git does this) - while as, when a commit is being read, we're using it as a snapshot.
One way I like to think about this is that when you rebase a branch, the diffs are the same (barring any conflicts) but the commits are different. Just another reason commits aren't the same as diffs.
The diffs are often different, even without conflicts. Try comparing them some time, and look closely at the diff... look at the lines starting with @. People usually ignore those lines but "patch" does NOT.
This is not an irrelevant detail, but it's the result of a three-way merge. The three-way merge can update those @ lines if it has a complete set of inputs (all three inputs). If you to make a patch from one branch and then apply it to a different branch without using the three-way merge algorithm (stripping the diff of all its context), the patch may fail to apply even if the three-way merge succeeded without conflicts.
I think this is more a sign that git (porcelain) is not aligned with the underlying model.
It is actually a pity that so little effort went into git UI. I find the OP explanation of git model awesome and the presented concepts beautiful, but the cli utility has countless naming and consistency problems which make me sad that hg didn't win over git. Life would be much simpler for many developers if it did, imho.
How to represent those snapshots, and fix the storage bloat a naive implementation would cause, is a completely different problem.
One of the things that makes Git smart is that it doesn't try to optimize things prematurely. SVN and co. would store actual diff data, but this made some operations really hard to implement (and, in many cases, slow).
Git has commits conceptually as snapshots. It's up to the storage code to figure out how to deal with this.
> But I find it far more intuitive and useful to think of commits as "diffs + some metadata".
Except that this is not what's happening. I wouldn't even call it an abstraction, it's how things actually work. What you call abstractions are actually operations. If we run a diff we are interested in the changes, but if you ask git to show you the commit it will show you just that.
If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.