Opened 6 months ago

Last modified 3 weeks ago

#30770 new defect

consider alternatives to the puppet mono-repo

Reported by: anarcat Owned by: tpa
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #29387 Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

another aspect of "how to publish our puppet repos" and how to collaborate is how to manage sub-repositories. expanding on the "mono-repo" problem discussed in #29387, i have found the following options:

  1. current "monorepo" approach
  2. pure librarian / r10k
  3. git submodules
  4. git subtree (originally from apenwarr but now merged in mainline since git 2.22)
  5. git subrepo
  6. myrepos and its numerous alternatives
  7. Puppet itself, with the vcsrepo module or the puppet module command

i'll add more as i find them here. i should probably make a more detailed review of the advantages/inconvenients of all of those...

Child Tickets

Change History (7)

comment:1 Changed 6 months ago by anarcat

Description: modified (diff)

This talk is a good introduction for git submodule, librarian and r10k. Based on that talk and these slide, I've made the following observations:

monorepo

This is our current approach, which is that all code is committed in one monolithic repository. This effectively makes it impossible to share code outside of the repository with anyone else because there is private data inside, but also because it doesn't follow the standard role/profile/modules separation that makes collaboration possible at all. To work around that, I designed a workflow where we locally clone subrepos as needed, but this is clunky as it requies to commit every change twice: one for the subrepo, one for the parent.

Our giant monorepo also mixes all changes together which can be an pro *and* a con: on the one hand it's easy to see and audit all changes at once, but on the other hand, it can be overwhelming and confusing.

But it does allow us to integrate with librarian right now and is a good stopgap solution. A better solution would need to solve the "double-commit" problem and still allow us to have smaller repositories that we can collaborate on outside of our main tree.

submodules

The talk partially covers how difficult git submodules work and how hard they are to deal with. I say partially because submodules are even harder to deal with than the examples she gives. She shows how submodules are hard to add and remove, because the metadata is stored in stored in multiple locations (.gitsubmodules, .git/config, .git/modules/ and the submodule repository itself).

She also mentions submodules don't know about dependencies and it's likely you will break your setup if you forget one step. (See this post for more examples.)

In my experience, the biggest annoyance with submodules is the "double-commit" problem: you need to make commits in the submodule, then *redo* the commits in the parent repository to chase the head of that submodule. This does not improve on our current situation, which is that we need to do those two commits anyways in our giant monorepo.

One advantage with submodules is that they're mostly standard: everyone knows about them, even if they're not familiar and their knowledge is reusable outside of Puppet.

librarian

Librarian is written in ruby. It's built on top of another library called librarian that is used by Ruby's bundler. At the time of the talk, was "pretty active" but unfortunately, librarian now seems to be abandoned so we might be forced to use r10k in the future, which has a quite different workflow.

One problem with librarian right now is that librarian update clears any existing git subrepo and re-clones it from scratch. If you have temporary branches that were not pushed remotely, all of those are lost forever. That's really bad and annoying! it's by design: it "takes over your modules directory", as she explains in the talk and everything comes from the Puppetfile.

Librarian does resolve dependencies recursively and store the decided versions in a lockfile which allow us to "see" what happens when you update from a Puppetfile.

But there's no cryptographic chain of trust between the repository where the Puppetfile is and the modules that are checked out. Unless the module is checked out from git (which isn't the default), only version range specifiers constrain which code is checked out, which gives a huge surface area for arbitrary code injection in the entire puppet infrastructure (e.g. MITM, forge compromise, hostile upstream attacks)

r10k

r10k was written because librarian was too slow for large deployments. But it covers more than just managing code: it also manages environments and is designed to run on the Puppet master. It doesn't have dependency resolution or a Puppetfile.lock, however.

r10k is more complex and very opiniated: it requires lots of configuration including its own YAML file, hooks into the Puppetmaster and can take a while to deploy. r10k is still in active development and is supported by Puppetlabs, so there's official documentation in the Puppet documentation.

Often used in conjunction with librarian for dependency resolution.

One cool feature is that r10k allows you to create dynamic environments based on branch names. All you need is a single repo with a Puppetfile and r10k handles the rest. The problem, of course, is that you need to trust it's going to do the right thing. There's the security issue, but there's also the problem of resolving dependencies and you *do* end up double-committing in the end if you use branches in sub-repositories. But maybe that is unavoidable.

git subtree

This article mentions git subtrees from the point of view of Puppet management quickly. It outline how it's cool that the history of the subtree gets merged as is in the parent repo, which gives us the best of both world (individual, per-module history view along with a global view in the parent repo). It makes, however, rebasing in subtrees impossible, as it breaks the parent merge. You do end up with some of the disadvantages of the monorepo in the all the code is actually committed in the parent repo and you *do* have to commit twice as well.

subrepo

TODO. https://github.com/ingydotnet/git-subrepo

myrepos

myrepos is one of many solutions to manage multiple git repositories. It has been used in the past at my old workplace (Koumbit.org) to manage and checkout multiple git repositories.

Like Puppetfile without locks, it doesn't enforce cryptographic integrity between the master repositories and the subrepositories: all it does is define remotes and their locations.

Like r10k it doesn't handle dependencies and will require extra setup, although it's much lighter than r10k.

Its main disadvantage is that it isn't well known and might seem esoteric to people. It also has weird failure modes, but could be used in parallel with a monorepo. For example, it might allow us to setup specific remotes in subdirectories of the monorepo automatically.

TL;DR:

Approach Pros Cons Summary
Monorepo Simple Double-commit Status quo
Submodules Well-known Hard to use, double-commit Not great
Librarian Dep resolution client-side Unmaintained, bad integration with git Not sufficient on its own
r10k Standard Hard to deploy, opiniated To evaluate further
Subtree "best of both worlds" Still get double-commit, rebase problems Not sure it's worth it
Subrepo ? ? ?
myrepos Flexible Esoteric might be useful with our monorepo
Last edited 6 months ago by anarcat (previous) (diff)

comment:2 Changed 6 months ago by anarcat

Description: modified (diff)

comment:3 Changed 6 months ago by anarcat

Description: modified (diff)

comment:4 Changed 6 months ago by anarcat

Description: modified (diff)

comment:5 Changed 6 months ago by anarcat

Description: modified (diff)

comment:6 Changed 4 weeks ago by anarcat

i just tried subrepo for the lnav module and i gotta say it's pretty damn simple. all i did was:

git subrepo init modules/lnav/ -r https://gitlab.com/anarcat/puppet-lnav.git
git subrepo push modules/lnav  -r git@gitlab.com:anarcat/puppet-lnav.git

and voilà: the remote repo has the files complete with the full history of the changes in that subdirectory! that's a killer feature, i think.

the two different -r arguments are there so that people that clone the monorepo get the publicly-readable URL they can pull from in the future.

the only disadvantage of subrepo i can think of is that subdirectory isn't a real git repository locally: it's just a subdir.

comment:7 Changed 3 weeks ago by anarcat

Description: modified (diff)

apenwarr wasn't happy with subtree so they wrote subtrac (announcement) which reimplements it with submodules, in golang. might be worth taking a look again?

Note: See TracTickets for help on using tickets.