The rules of coding are as diverse, unique, and interesting as the people who work on them. But almost all of them have this in common: they grow over time (the rules of symbols, not people). Teams expand, requirements grow, and time, of course, goes forward; And so we end up with more developers writing more code to do more things. And while we’ve all experienced the fun of deleting large chunks of code, that rarely makes up for the general expansion of our code bases.
If you are responsible for the architecture of your organization’s code base, then at some point you have to make some firm choices about how to manage this growth in a scalable way. There are two popular architectural alternatives to choose from.
One is the “multiple repo” architecture, where we break the code base into increasing numbers of small repo, along sub-team boundaries or project boundaries. The other is “monorepo”, where we maintain one large and growing repository containing code for many projects and libraries, with multiple teams collaborating across it.
The multiple buyback approach may be tempting at first, as it seems easy to implement. We’re just creating more repositories because we need them! At first, it doesn’t seem like we need any special tools, and we can give individual teams more autonomy in how they run the code.
Unfortunately, in practice, multiple repo architectures often result in a fragile, inconsistent, and change-resistant code base. This in turn could encourage the establishment of silos in the engineering establishment itself. In contrast, and perhaps unexpectedly, the monorepo approach is often considered a better, more flexible, more collaborative and scalable solution in the long run.
Why is this the case? Keep in mind that the difficult problem with database architecture involves managing changes in the presence of dependencies, and vice versa. And in a multi-repo architecture, repo repositories consume code from other repositories via version-published artifacts, making change propagation more difficult.
Specifically, what happens when we, the owners of repo A, need some changes in the consumer repo? First, we must find the gatekeepers of repo B and convince them to accept the change and publish it in a new version. Then, in a perfect world, someone would find all the other consumers of repo B, upgrade them to that new version, and republish them. And now we have to find consumers from those early adopters, upgrade and repost them* against the new version, etc., nauseatingly frequently.
But who is the “person” who will do all this work? How will they identify all these consumers? After all, dependency metadata lives on the consumer, not the consumer, and there is no easy way to undo dependencies. When ownership of the problem is not immediate and its solution is not obvious, it is ignored, and thus none of this effort actually occurs in practice.
And that might be fine, at least for a while, because the other (hopefully!) repo repositories are installed in the previous version of the dependency. But this convenience is short-lived, because sooner or later many of these consumers will be combined into a deployable artifact, and at this point someone will have to choose one version of the dependency for this tool. So we ended up with a transitive version conflict caused by one team in the past and planted in the code base like a time bomb, exploding just as another team needed to integrate the code into production.
If this problem sounds familiar, it’s because it’s an internal version of the infamous “dependency hell” problem that usually afflicts external dependencies for codebases. In a multiple repurchase architecture, first-party dependencies are, technically, treated like third-party dependencies, even though they are written and owned by the same organization. So with a multi-repo architecture, we’re essentially choosing to take a massively expanded version of dependency hell.
Compare all this with monorepo: all consumers live in the same source tree, so finding them can be as simple as using grep. Since there is no deployment step, and all code shares a single release (represented by the current commit), updating consumers transiently and with a procedurally straightforward step. If we have good test coverage, we have a straightforward way of knowing when we’ve done it right.
Now, of course, ‘straight’ doesn’t mean ‘easy’: upgrading the repo with a static step in and of itself may not be an easy effort. But that’s just the nature of changes in the code. No database architecture can remove the irreducible part of an engineering problem. But monorepo at least forces us to deal with the necessary difficulty now, without creating an unnecessary difficulty later.
The tendency of the multiple repo architecture to make a dependency hell on others in the future is a manifestation of a broader problem related to Conway’s Law: “Any organization that designs a system will produce a design whose structure is a copy of the organization’s communications structure.” The opposite is also true: the structure of communications in your organization tends to mimic the structure around which that communication occurs. In this case, the fragmented code base architecture can lead to the balkanization of the engineering organization itself. Database design ends up incentivizing gatekeepers and liberating the responsibility to jointly achieve common goals, because those common goals are not architecturally represented. The monorepo gently supports and enforces organizational unity: everyone collaborates on a single database, and the lines of communication that this imposes are exactly those that our organization needs in order to succeed in building a unified product.
monorepo is not a panacea. It requires appropriate tools and processes to maintain engineering performance and efficiency at scale. But with the right architecture and the right tools, you can maintain your unified code base, and your unified organization, at scale.