Today we’re going to cover version control with Git from the perspective of someone coming from a second-gen source control tool like CVS, SVN, or TFS. If you’re depending on how familiar you are with source control you may want to skip around through this article. We’re going to cover the following:
- What is a VCS?
- Version Control history
- What makes Git different
- Git Flow
- Basics of Git
- Useful Links
So without further ado, let’s begin as I take a biiiig step back and start with…
What is a “Version Control System (VCS)” and why do we need one?
A version control system is simply a tool that helps you manage changes to documents and it’s most often used for working with source code. While you can definitely build software without using a VCS, anything aside from a small hobby project will likely benefit from some form of version control because a good VCS will act as a backup in case you ever need to revert changes and go back to an earlier copy of your work or just lose your work entirely. It also helps with collaboration because when you are developing something with others you’re going to need to share your work and doing that over e-mail or a shared drive is just a complete mess. It’s also not just for software! Many people are using a VCS, even Git specifically for things like writing books or documentation or even doing graphics work. People have even used git as a distributed database that keeps track of purchase orders, but that’s getting away from what we’re going to discuss.
The History of VCS
Version control can be thought of as three generations of different tools; the first gen, second gen, and third gen. A first gen tool would be something like “RCS” (Revision Control System). I’ll spare you the details, but RCS was made for *NIX systems and was meant for working with a single file at a time, using locking to handle concurrency, and didn’t support any kind of networking. The second gen blew the first out of the water; these would be CVS, SVN, TFS and the like. Second gen VCSs handled multiple files, and introduced the idea of merging instead of locking where you would merge your local work in with the master copy before committing it. These systems would support networking and the idea of a central server that keeps track of all the changes and provides developers with a central repository to check in their code to and get the latest code from. Then the third gen of VCSs came out – these would be tools like Mercurial and Git. Unlike the prior generation of VCSs, Git is distributed and doesn’t necessarily have to have one definite remote master server and for concurrency it uses the “commit before merging” philosophy which is a polar opposite of how things work in the second gen.
Talking about Git specifically, it was created in 2005 by Linus Torvalds (he’s kind of a big deal) and was actually originally just a series of shell scripts that he made after getting fed up with another, proprietary VCS called BitKeeper. A funny point here is that Linus claimed to make Git with the following philosophy in mind: “Take CVS as an example of what not to do. If in doubt, make the exact opposite decision.”
How’s Git different from the second gen VCSs?
While there’s almost an endless supply of differences, let’s look at the major ones:
Branching and Merging
In TFS making a branch was a time-and-space-consuming operation that was mainly meant for some great deviation of the project. This process would eat up tonnes of resources and moving/merging between branches was a great feat. In Git, branching takes almost no time at all and doesn’t use any additional space to simply make that branch. Creating a branch is encouraged for even the smallest things – making a small style update? Create a branch. Updating unit tests? Create a branch. Trying something out for fun? Create a branch. In a big project your branching diagram would look something like this:
As you can see above, you have your various user stories each in their own branch, sometimes merged between each other, but ultimately everything comes from and goes back into a master branch.
Unlike SVN, TFS, and CVS there doesn’t need to be a central remote repo that tracks everything. You can use git entirely on your local machine alone with no remote server. Every machine with a git project has an entire history and contents of the whole project, and the user optionally keeps it up to date with one or more remote servers by pushing or pulling changes. Consider the following possible scenario where every developer has their own remote repository that an integration manager pulls from, integrates the code, then pushes it to a master ‘blessed’ repository:
Also, here’s a diagram that explains how remotes and local repositories work, and how data moves between them:
So you add/move/remove and commit files just as in SVN/TFS/CVS, but you use push and fetch to go between the remote server and your own local repository. A pull is simply a fetch with a merge operation. This is more understandable when you consider that the remote branch and your own local version of that branch are logically two different branches with the same ancestors and not just the same branch copied remotely or locally. That is, it is easily possible and often the case that your local ‘develop’ may be in a different state than your remote ‘develop’ (known in git as origin/develop; ‘origin’ is the usual name assigned to your main remote repository) and you need to merge between that remote develop and your own though those merges, unlike merges from a story branch into the develop branch, will probably not have merge conflicts ever unless you’re actually working directly in the develop branch which you should probably not be doing in the first place due to the previous point that branching is inexpensive and a good way to keep things organized.
Staging Area, Working Copy
In SVN when you make a change and commit it, that change gets sent up to the remote SVN server that keeps track of all the changes. In Git, committing a change (like almost everything else) only does it locally. Remember I mentioned that every person/machine that has a git project has the entire thing? Committing simply updates that local repository. In addition to that local repository that you’ve committed to, there’s also something called the “staging area” – these are files that have not yet been committed, but are going to be in the next commit. This is contrary to your “working copy” which is the current state of all your local files as they stand at the moment. To elaborate, let’s say I have a project with a config file that is tracked in the VCS. If I update something in the config file required to make that project work locally for me, and then I muck around with making actual changes to the application I’m working on, I don’t want to commit those changes I’ve made to the config – I only want to commit my changes to the app. Git makes this easy – you simply stage all the changes you made with the exception of the config file, then commit that. The config will will stay as is in the working copy, but the other changes will be committed (locally!) as intended.
No Version Numbers
If the title of this heading threw you off and making you ask “how can you have no version numbers in a version control system?” – well, consider what I said before: every person/machine has their own entire copy of the repo, and any commits made on those machines are only applied locally. You can make multiple commits and branches and eventually you’re merge them together somewhere (probably) in some remote repository. How would you ever have a chronological order of version numbers if you’re working in isolation, offline, most of the time? There’d be no way to know who’s got version 1 and who’s got version 2. So how does Git tackle this problem? Well, each commit is instead given a commit id which is a hash of the entire repository and each file in it, as well as the metadata of the commit info (including the time). This means that even two copies of the same project committed at different times with different commit messages would still have different ID’s ensuring uniqueness.
This may seem confusing and backwards, but you get used to it pretty quickly and it just becomes part of your natural work process.
Snapshots, not differences
Lastly, remember when I said in Git branching is very inexpensive both time and space wise? That’s because Git doesn’t store file differences; e.g. instead of incrementally storing what change was made in each file in each commit it actually stores a full snapshot of that file. This is what allows Git to be so efficient at branching but it also makes git repositories take up a lot more space as you can imagine. On the same note, git branches don’t take up ‘any space’ at all because only modified files will ever be present in a branch. That is, if I have a branch called ‘develop’ and a branch called ‘story’ which was created based off develop, and I proceed to update one file in ‘story’ and commit it – the only extra space taken up by ‘story’ will be that one file; the rest are the same as in develop so git has no need to duplicate those files. Eventually, story gets merged into develop and so develop and story are now using the same files which means story ‘goes back to’ not using any space.
There’s a really good quote on the official Git website which states “This makes Git more like a mini filesystem with some incredibly powerful tools built on top of it, rather than simply a VCS.” This is a good way of thinking about Git if you’re familiar with filesystems.
Git Flow is one of the many possible branching strategies for git. For small projects it may make sense to simply have a master branch that everyone pulls from and pushes to, with every developer working in their own isolated branch the rest of the time. This doesn’t really work for huge projects because you may have a user story with two or more developers on it, who both need to make changes that work together but at the same time they don’t want to break what other people are working on in other user stories. Git Flow is a guideline on how to manage your branches and it looks something like this:
To give you a basic explanation: you have a master branch that contains your latest but stable product. The develop branch contains the latest and greatest code that all the developers have worked on and already tested. Developers split up into their feature branches (usually one per user story, or defect) based off develop and do their work there, merging it back to develop when their stuff works. That part of the process would probably include some sort of peer reviews (possibly using pull requests). The lead developer or someone who’s overseeing the project from a technical side would take care of merging from develop into master to keep master up to date. Occasionally, a major problem in the production version of the application may require a hot fix that would be made in a branch off of master and merged back into master.
There’s a great write-up of git flow here.
There are many clients you can use to ‘use git’ and I’ll run through them briefly. It ultimately is a matter of personal preference since they all do the same thing.
Git Command Line
Officially, Git is a command line tool and you can use all of its features right from the CLI if you’re into that. I am no stranger to the command line but I personally prefer to use a GUI for it because it makes it easier to visualize changes. If you do decide to go with the CLI version, you can get it here: http://git-scm.com/
Git Built-in GUI (gitk)
If you call up “gitk” from the command line it will launch a very ugly looking GUI for Git that comes bundled. Frankly, if you’re going to use a GUI you might as well get one that looks nice but I guess gitk does what it’s supposed to.
This is what I personally use. It is free, functional, and although it is a bit bloated from being made with Java it is quite good. It is also cross platform which is a big plus for some projects. Get it here.
Github for Windows
Haven’t used it extensively but it looks very sleek and simple, though as a result it also looks like its missing a lot of features.
Visual Studio Integration
Once again, haven’t used this extensively but I have heard from numerous colleagues that frankly it sucks and should be avoided at all costs due to instability and lack of features.
The basics of using Git
Alright – so how does one actually get started with Git? Well, if you’re looking for an in-depth walkthrough with examples I’d actually recommend for you to check out Roger Dudler’s Git – The Simple Guide. However, if you’re comfortable with just a general high-level overview then here’s what you will be doing when you are using Git:
Clone or initialize your repository
If you’re starting a new project, you will need to initialize a repo using “git init” and if you already have some remote repository then you’ll need to clone from that repository using “git clone” – you’re now going to be on your repository’s default branch (probably either ‘master’ or ‘develop’). From here you can already make changes to your project in your IDE or text editor but as I mentioned earlier, we should really create a branch to work off of so if we commit and push any bad code (which I’m sure would never happen) then it won’t affect the main branch that others are working with.
Create a branch to do your work
Either “git checkout” an existing branch or “git branch” into a new one. Make your changes to the application, adding or removing any files as necessary, and staging any changes using it “git add”. Using “git add” puts files in the working copy into the staging area.
Commit your work
When you’re ready to commit, you can double check what changes are going to be made using “git status” then commit using “git commit”. The changes are now committed locally and you are one commit ahead of the remote branch. To move the commit to the remote repository, you do a “git push” to origin. This is optional and frankly entirely unneeded if nobody else is working the same feature as you because in this case you are pushing your local non-develop branch to the remote server.
Merge your work
Once your feature is done, you want to get it from your feature branch back into develop. You switch branches to develop and use “git merge” to merge the changes from the feature branch into your current local develop branch. Once the merge is done, use “git push” again and push the changes to the origin remote server.
Further Reading & Useful Links
90% of the time the above steps in the basics are all you’re going to be doing when you’re working with Git. In addition to that, it is worth looking into how to reset or revert commits, how to stash changes, and how rebasing works.
Here are some services where you can host your project (privately or publicly) using Git: