You're discounting the idea that someone might want to destructively rewrite the...

mikeash · on Dec 15, 2013

I'm not discounting it, I simply don't agree with how git implements it.

IMO the correct option is to create a new repository that has the same history as the old repository minus the offending commit (or possibly with an edited version of that commit that leaves out the offending string).

Because it creates a new repository, there's no risk of data loss in your old repository. Once you're confident that the operation succeeded, you can swap them.

I haven't had to do this for a long time, but as I recall, this is basically how svn does it. It works fine.

The problem with git is that it makes this far too easy and it works by editing existing repositories rather than creating new ones. So instead of once-in-a-blue-moon repository hacking to get rid of that password you accidentally committed, you get people rewriting history because they think the real history isn't "clean". I know a lot of people who routinely edit their local history before pushing changes to a shared repository because they don't want other people to see their true "dirty" history. This is insane.

Finally, I'm confused about something, so maybe you could clear this up for me. I keep seeing assurances that 1) git does not actually destroy any data, and you can always recover if you screw up and 2) editing history is sometimes a vital necessity for cases like when you commit passwords. You yourself made these assurances in this comment. However, 1 and 2 are obviously mutually exclusive. If you can always recover then you can't actually scrub the repository of accidentally committed passwords and the like. Which one is actually true?

rajivm · on Dec 15, 2013

Re: 1 and 2

1) This is almost true. Anything that is committed to Git is recoverable. When you "re-write" history, Git is creating a new set of commits in the history, an "alternate history path." It does not destroy the original commits, but there is no named reference to them (unless you created a branch/tag pointing to this line of commits).

2) In this case, if you want to actually destroy these unreferenced commits, you must run "git gc". This IS a destructive command. It will remove any unreferenced commits from the repository. (gc = garbage collect). If you never garbage collect, you will always have access to anything that was ever committed. It just might be hard to find since the only reference is the ref-log (if it was recent) or the commit hash.

mikeash · on Dec 15, 2013

Since garbage collection does happen automatically after a while, it seems that the "doesn't destroy data" bit isn't completely true. But I understand that it's a fairly rare case where you're going to screw something up and then not bother to get it back until after garbage collection cleans it up.

Thanks for clarifying that.

sofal · on Dec 15, 2013

I know a lot of people who routinely edit their local history before pushing changes to a shared repository because they don't want other people to see their true "dirty" history. This is insane.

This is no more insane than editing a source code file before you save it to the file system. Git is used as a development tool as well as version control, and developers are therefore encouraged to commit often, even if the code does not actually compile yet. There is no more need to fill the published history with all of these WIP commits than there is for me to know about every goddamn keystroke you made while you were dicking around with that config file.

mtdewcmu · on Dec 16, 2013

Is the history stored as a text file somewhere that you can just edit? I sometimes wish git were a bit more transparent and less of a black box.

pyre · on Dec 16, 2013

I suggest that you pick up any git tutorial out there. It will soon become less of a black box.

mtdewcmu · on Dec 16, 2013

I've read a lot about git. The docs generally don't pick apart what's inside the .git directory.

pyre · on Dec 16, 2013

- The history items are stored as commit objects that are identified as a SHA-1 sum of the contents (including meta-data like Authored By, Committed By, etc).

- One of those meta-data items is "Parent Commit," so if you change one item in history, it changes the SHA-1 sum of all subsequent items (because at the very least they all need to be re-parented).

- All of the commit objects are stored under .git/objects.

- Branches are just files under .git/refs/ that contain the SHA-1 sum of the most recent commit on that branch. This is why they are called 'branch pointers.' That's basically all they are.

- If you have a history of 5 commits, and make a change to the initial commit, you now have 10 commits in your .git/ directory. Your (e.g.) 'master' branch will point to the most recent 'tree' of 5 commits. The other commits will still exist in .git/objects, but there will be no branches pointing them. You can use 'git reflog' to find them, or access them by their SHA-1 sum.

- Eventually 'git gc' (gc = garbage collect) will clean out the unreferenced commits, but this happens rarely if you don't explicitly run the command.

- When you 'git push,' you are only pushing branches to the remote repo, so commits that are stored locally, which are not referenced by one of those branches you are pushing, will not be pushed out. If you have commits that you don't want to end up in limbo like this, you should 'git tag' them or create a branch (e.g. 'archive/master-2013-12' that points to them).

mtdewcmu · on Dec 16, 2013

It looks like .git/logs contains the history. It looks like the file format is a space-separated list, with the format "$parentcommitsha1 $newcommitsha1 ... $commitmessage". That's fairly comprehensible. What are the SHA-1 sums of? Are they of the entire snapshot, or the delta? I went into objects/ and ran `sha1sum $objfile`, and the sum did not match the file name. So that remains obscure. `file $objfile` could not identify the format; it gave nonsense.

Thanks for the help.

>One of those meta-data items is "Parent Commit," so if you change one item in history, it changes the SHA-1 sum of all subsequent items (because at the very least they all need to be re-parented).

What sequence of operations would change a history item in that way?

pyre · on Dec 17, 2013

> It looks like .git/logs contains the history. It looks like the file format is a space-separated list, with the format "$parentcommitsha1 $newcommitsha1 ... $commitmessage". That's fairly comprehensible.

I've never looked at .git/logs, but it looks like that is used by the `git reflog` command. It's basically a history (or log) of every commit that a particular reference has pointed to[1]. For example, I cloned the git source code:

  user@host ~/src/git % cat .git/logs/HEAD
  0000000000000000000000000000000000000000 d7aced95cd681b761468635f8d2a8b82d7ed26fd First Last <user@example.com> 1387237920 -0500	clone: from https://github.com/git/git.git

  user@host ~/src/git % git reflog
  d7aced9 HEAD@{0}: clone: from https://github.com/git/git.git

Note: `HEAD` is a reference to the current branch. E.g.:

  ~/src/git $ cat .git/HEAD
  ref: refs/heads/master

  ~/src/git $ cat .git/refs/heads/master
  d7aced95cd681b761468635f8d2a8b82d7ed26fd

It's also of note that branches are referred to as 'references' too, hence storing them under `.git/refs/`.

> What are the SHA-1 sums of? Are they of the entire snapshot, or the delta? I went into objects/ and ran `sha1sum $objfile`, and the sum did not match the file name. So that remains obscure.

See: http://stackoverflow.com/questions/5290444/why-does-git-hash...

[1]: Since the local repository was created. This information does not sync between local and remote.

mtdewcmu · on Dec 17, 2013

>I've never looked at .git/logs, but it looks like that is used by the `git reflog` command. It's basically a history (or log) of every commit that a particular reference has pointed to[1]. For example, I cloned the git source code:

I think it's more or less the DAG represented as an adjacency list. I'd have to think a bit about why there is a separate log file for each branch. It seems that there's some redundancy in doing that, and I'm wondering what the advantages and disadvantages are of splitting the history up in that way.

>It's also of note that branches are referred to as 'references' too, hence storing them under `.git/refs/`.

I've developed a loathing of excessive hierarchies/trees, so I'd rather see them flattened in a single directory. But that makes sense.

>See: http://stackoverflow.com/questions/5290444/why-does-git-hash....

That's a good link. What's in an object? If an object corresponds to a commit, then it must aggregate data about changes to multiple files.

pyre · on Dec 17, 2013

> I think it's more or less the DAG represented as an adjacency list. I'd have to think a bit about why there is a separate log file for each branch. It seems that there's some redundancy in doing that, and I'm wondering what the advantages and disadvantages are of splitting the history up in that way.

Think of each branch as a pointer. Then realize that you can make that pointer point anywhere on the DAG, even to parts of the DAG that have no connection to each other. The `reflog` is a (local, non-comprehensive) history of where that pointer has pointed. That's why there is a separate log for each branch. I guess that technically they could have a single log file and add another field to specify the branch, but using the same directory tree structure as under .git/refs/ makes the mental model simpler (and probably a performance improvement not to have to parse the reflog for every branch just to see the reflog for one branch).

> I've developed a loathing of excessive hierarchies/trees, so I'd rather see them flattened in a single directory. But that makes sense.

I'm not sure what branches living under .git/refs has to do with excessive hierarchies/trees. There are enough things stored in the .git directory, that if you mashed them all together it wouldn't make any sense.

> What's in an object?

If you really care to dive deeper, you can check objects here: https://github.com/git/git/blob/master/object.h

You can get a shorter version towards the bottom of the git manpage (e.g. `man git`):

  IDENTIFIER TERMINOLOGY
         <object>
             Indicates the object name for any
             type of object.
  
         <blob>
             Indicates a blob object name.
  
         <tree>
             Indicates a tree object name.
  
         <commit>
             Indicates a commit object name.
  
         <tree-ish>
             Indicates a tree, commit or tag
             object name. A command that takes a
             <tree-ish> argument ultimately wants
             to operate on a <tree> object but
             automatically dereferences <commit>
             and <tag> objects that point at a
             <tree>.
  
         <commit-ish>
             Indicates a commit or tag object
             name. A command that takes a
             <commit-ish> argument ultimately
             wants to operate on a <commit> object
             but automatically dereferences <tag>
             objects that point at a <commit>.
  
         <type>
             Indicates that an object type is
             required. Currently one of: blob,
             tree, commit, or tag.
  
         <file>
             Indicates a filename - almost always
             relative to the root of the tree
             structure GIT_INDEX_FILE describes.

mtdewcmu · on Dec 17, 2013

I noticed that there is no delta compression until objects get incorporated into a pack.

>Think of each branch as a pointer. Then realize that you can make that pointer point anywhere on the DAG, even to parts of the DAG that have no connection to each other. The `reflog` is a (local, non-comprehensive) history of where that pointer has pointed.

I got that branches were pointers. Now that I'm aware that the DAG is fully represented inside objects, I can see that what's inside logs/ is actually just logs. Each log corresponds to a subgraph of the full DAG. Getting history from a log would be more efficient than from the objects themselves, because to get it from objects, you'd have to dereference a lot of object references.

>I'm not sure what branches living under .git/refs has to do with excessive hierarchies/trees. There are enough things stored in the .git directory, that if you mashed them all together it wouldn't make any sense.

Having to descend through layers of subdirectories makes things harder. I'd reduce the depth of the directory tree to the absolute minimum. It's hard to tell if this is the minimum without knowing exactly what all the implementation constraints might have been.

I can see that the real meat of this system is the object store. It's useful to know about `git cat-file` for inspecting it.

pyre · on Dec 18, 2013

> Each log corresponds to a subgraph of the full DAG

I don't have the time to keep up this conversation, but this assertion is wrong. It is not a subgraph. It is a history of the values that the pointer was pointing to (e.g. "Pointer <branch_name> changed from pointing to value AAA to value BBB due to action XXX"). That is basically what all of those entries are. 'AAA' and 'BBB' maybe be in completely unconnected sections of the DAG.

If you create a new repository and add a couple of commits, then yes the reflog files will look like a history, but only because the branch pointer has traversed the DAG from start to end with no deviations.

For example you can have a DAG like this:

   A - B - C - D - E

   X - Y - Z

If you change the branch pointer to move from B to Z, this is not a subgraph. Well, I guess technically you could call it sgraph of the history of the branch pointer, but it in no way corresponds to the DAG other than that all of the pointer values exist within the DAG. For example the following operations:

  git clone
  git reset --hard Z
  git reset --hard X

Would create a graph like this (assuming that master pointed to E when you cloned):

  E - Z - X

Notice that this really don't correspond to the DAG other than the fact that those objects exist in the DAG.

Note:

- All of this information is only contained within the .git/logs files. None of it is stored in the objects themselves.

mtdewcmu · on Dec 17, 2013

I'm guessing that the reason each branch has its own history is probably related to the goal of only appending new entries at the end of things. Since any branch can be under development, they need their own files. It sort of makes sense.

pyre · on Dec 17, 2013

I still think that you're a little confused. The reflog is a "history of where this branch has pointed since the repository was created/cloned." If I clone a repository with a history of 100 commits on the 'master' branch, the reflog for the 'master' branch will only have one entry. You can completely delete the `.git/logs` and still run `git log` successfully.

Here's an example:

  $ git clone blah
  
  DAG:
  
    A - B - C - D - E
        \
         Z - X - Y
  
  
  Branches:
  
   master => E
   topic/new-feature => Y
  
  
  reflog:
  
    master
      E - clone from blah
  
    topic/new-feature
      Y - clone from blah

Notice how cloning a repository with an existing DAG doesn't populate the reflog. It just give it a single entry saying that the branch was updated from 'nothing' to whatever commit it was pointing to remotely.

Now let's change where 'master' is pointing:

  $ git reset master C
  
  
  DAG:
  
    A - B - C - D - E
        \
         Z - X - Y
  
  
  Branches:
  
   master => C
   topic/new-feature => Y
  
  
  reflog:
  
    master
      E - clone from blah
      C - reset to C
  
    topic/new-feature
      Y - clone from blah

Notice how the reflog is a history of the values that the branch was referencing, but is not the history as what you get when you run 'git log'. After the reset, 'git log master' would show you commits A, B and C, but A and B are nowhere in the reflog.

mtdewcmu · on Dec 17, 2013

I see. I started reading the internals chapter at[1]. This free book seems better than the O'Reilly book, which I bought.

So the DAG is actually stored inside objects. The contents of the objects directory could be described by a relational schema, and I think that would make it easier for a lot of people to understand (myself included):

  Blob
  - sha1hash (primary key)
  - contents (blob)

  Tree
  - sha1hash (primary key)

  TreeEntry
  - treeid (foreign key into Tree)
  - mode (mode of blob/subtree)
  - type ("blob" or "tree")
  - objectid (foreign key into Tree or Blob)
  - name

  Commit
  - sha1hash (primary key)
  - tree (foreign key into Tree)
  - parent (foreign key into Commit)
  - author
  - committer
  - comment

The tree entries are actually denormalized and stored as a list inside the tree. You could represent this more accurately with XML. But who likes XML?

[1] http://git-scm.com/book

phaemon · on Dec 15, 2013

You don't actually know how git implements it, so how can you disagree with it?

There is no such thing as "an edited version" of a commit. A commit is identified by a SHA1 hash of its index of contents. If you change one bit you get a new commit.

You're a C programmer, right? If someone gave you a specification for writing a program to implement git, without telling your what it was, you'd tell them it would take 2 weeks. And that's because you'd reckon it would take 2 hours to knock out a rough version and a couple of days to clean it up.

Seriously, it's that simple. Just go learn how it works.

mikeash · on Dec 15, 2013

I understand how it works. Of course there's such thing as "an edited version" of a commit: it's a new commit that you create by taking an existing one and altering it. If you want to argue about terminology, please be my guest, but that's all your dispute is.

phaemon · on Dec 16, 2013

If you know how it works then where did your last question come from? The bit you're "confused" about?

It's obvious what the answer is if you know how it works, so what was your point exactly?

mikeash · on Dec 16, 2013

I know how git works in general. I wasn't 100% clear on the whole garbage collection aspect of it, which is hardly a central feature.

There's a difference between "has no idea how it works" and "understands the overall structure but doesn't know every single detail".

mtdewcmu · on Dec 16, 2013

It might be a better idea to just change the password.

pyre · on Dec 16, 2013

There is a point past which you can be too pedantic and be confused with a troll. I think that you may be straddling that line.

Next time, rather than just assume that the poster isn't smart enough to realize that a compromised password should be changed, maybe you could take in the fact that it's probably just an example of data that you might want to extract from your history if it's automatically there. I can think of numerous scenarios where someone might want to remove a password from the history even if it's not compromised (e.g. want to publish a private repo).

mtdewcmu · on Dec 16, 2013

I meant to raise the question of whether it's worth trying to expunge something sensitive from git. You'd have to track down all the clones of that repository. Even if you thought it was gone, it would be prudent to change the password anyway.

pseut · on Dec 16, 2013

s/password/large binary/