Distributed and Normal Version Control Systems

Version Control

Distributed version control systems (DVCS) have been around for quite a while, and I came across  Joel’s post and thought I ought to investigate.

I had never looked at any DVCS before, but I have to admit, the article had the desired effect, and I finally bothered to download a DVCS tool, in this case, mercurial.  Since mercurial is written on top of python it is nicely portable across platforms, and seems to present the same feature set on all platforms (Windows and linux), including the built-in websever (with nice branching graphs).

So what is the fuss all about?  If you are used to ‘normal’ version control systems such as subversion then the most interesting difference is best explained in VCS terms:  when you ‘checkout’ from the (main DVCS) repository you are always checking out a new branch.  That’s it really.

In normal VCS you would check out a copy of the repository, and your commits would go back onto the trunk/head of that repository, unless you created a branch, and checked that out.

In DVCS you always clone the entire repository onto your machine, and your commits go against your own clone of repository (that you cloned).  In effect, this is forced branching for the old VCS crowd.  Consequently, merging in DVCS systems is faster and better than in normal VCS systems (especially the older ones) as it is a core feature of DVCS.

The DVCS approach is great for most projects, e.g. linux (which famously uses git).  However in large businesses control is key, and they still like to have a centrally hosted and administered repository, where all manner of control is applied to all manner of things (commit, branch, tagging rights, browsing etc.).   Branches are (generally) publicly viewable to anybody with access to the repository.

Any repository clones that you make (and branches off of the clone) are entirely private to your local machine, and do not pollute the main/master repository with dead-end branches.

One minor drawback of DVCS is that until you push your changes back to the remote repository your changes are entirely local to your machine.  This is a drawback in the sense that in some businesses you have

  • high turn over of staff
    • the developer doesn’t push back to the remote repository before they leave and their work is lost
  • or, and it does happen, unstable developer machines (e.g. user allowed to install software, old hardware that fails etc.).
    • machine dies and work is lost

These are, frankly, minor issues.  If you push and pull frequently enough, you can always get your repository back from another developer.

One other minor issue with DVCS is that you are always cloning an entire repository.  For large repositories you are going to start getting performance issues based on, in no particular order:  bandwidth, main repository speed, and local disc performance.  This is equally true of VCS repositories.  However you can do sparse checkouts, which can help.

The general lesson to learn here for any type of VCS is to be quite ruthless with your repository management:  is it too large, are there too many projects (KDE is a classic example)?   If you use a CI tool as part of your build process (and you should), you are probably aware that the CI tool will often check out your entire repository (depending on set up), and therefore the repository size becomes a time issue for rebuilds.  For DVCS the overall main repository size (not the checked out repository) is going to be considerably smaller than a standard VCS repository as it will have had less commits made to it (and a higher percentage of useful ones).

DVCS have the benefit of being built from scratch based on several decades of experience and usage of previous version control systems and where they went wrong.  You can commit whenever you like, without having to worry about meeting ‘commit requirements’ such as test suites passing (until you commit to the master repository).  The performance is invariably much better than traditional VCS.  As mentioned by Linus Torvalds here (if you get the chance, watch the full one hour talk, but be objective about what is said), KDE imported into git takes about 1.3Gb, in Subversion it takes over 4Gb, in other words DVCS efficiently stores data.

There seems to be a slight amount of mania regarding DVCS.  If you use an existing VCS and works for you, don’t worry! If you’re starting a new project, consider DVCS.  If you’re starting a new project that is cross platform, consider Mercurial first.

Update: 05 June 2010:  Git also produces the same graphs that mercurial does via ‘hg server’, but from the command line.  See here for details.