How many Git repos?
I’ve been pondering how to split (or not as the case may be) code between separate Git repositories. Say we’re developing a web-application that consists of three modules: a SQL database schema, a Haskell web server, and a web client built using Node. We could put this code into repositories in different ways; for example:
The first alternative – one repository per module – seemed compelling at first. This encourages clean separation, lends itself to using Git tags for versions, and allows us to grant access to code on a need-to-know basis.
For over 2 years, that was what we did in our startup, but it wasn’t as clean and convenient I had first imagined.
Following the one-repository-per-module policy, we created a new repository when we broke out one part of the project into its own library. For example, we put our email-handling code into its own module and therefore also in its own repository. The number of repositories we needed to build the product steadily grew – in our case to over 15.
Managing the code in separate repos was inconvenient: pulling changes from all repos in case something changed, creating versions and specifying dependencies between modules, adding the same branch to many repos when changes touch more than one repo, and the inability to make atomic API changes that span more than one module. I think Jack O’Conner expressed it nicely:
So the question isn’t “One big repo or many small repos?” It’s actually “One big repo or many small repos with tooling.”
Since Git is a powerful tool, I now think that project1 code and related artifacts should be kept in a single large2 repository (except for open-sourced modules3). We only have a single repository to pull and push, we can make atomic commits when changes span many modules, we can include deployment and development environemnt code, we have a natural entry-point for new developers, we have a combined view of what has changed in the project, we don’t have to work across several repositories to pin-point where bugs were introduced, and we can modularise and re-organise with minimal overhead.
Other people seem to have come to similar conclusions about grouping modules in a single repository: On monolithic repositories and Choosing between Single or multiple projects in a git repository?. Note that the latter post uses “project” to refer to what I’ve called “module”.
I have yet to find a strong argument for splitting project code into many repositoris, except if Git struggles with the repository size or if parts of the code are highly sensitive and require granular access control. I’d love to hear about any argument you’ve come across for having a granular repository split though.
By “project” I refer to the effort to develop a set of interrelated functionality. For example, Google’s search engine would be one project (and include AdWords). YouTube and Android would each be a separate project. Thus, a project can comprise of several “customer products” and is determined more by how interrelated functionality is and how the project is delivered and deployed.↩
If we use one large repository, kinks in our Git workflow will be amplified due to a higher number of commits and contributors. I find Sandofsky’s proposal in Understanding the Git Workflow compelling and I think it would be suitable for most small-to-medium-sized projects.↩
Open-sourced modules are public in contrast to proprietary code and they are usually made available in public package repositories (NPM, Hackage, etc.) which solves some of the dependency problems, but slows down iteration cycles and increases overhead somewhat.↩