Opened 5 months ago

Closed 3 months ago

#32914 closed task (fixed)

review the puppet bootstrapping process

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Minor Keywords: tpa-roadmap-february
Cc: Actual Points:
Parent ID: #31239 Points: 1
Reviewer: Sponsor:

Description

our puppet bootstrap works, but it involves copy-pasting long lines of code. see if we can improve this somehow. maybe we could hardcode the puppetmaster cert to avoid one side of the process.

at least we could wrap it in a script on the puppetmaster to simplify *that* side.

also consider what's the current state of the art in that area.

Child Tickets

Change History (11)

comment:1 Changed 5 months ago by anarcat

Status: assignedaccepted

i looked here and there. i found a "bootstrap" bolt script here:

https://forge.puppet.com/puppetlabs/bootstrap

but that requires bolt, and from what i understand, it's just this awful curl | bash shell scripts that sucks in something from the central puppetmaster:

https://github.com/puppetlabs/puppetlabs-bootstrap/blob/master/tasks/linux.sh

so not really useful.

there's something called "autosigning" in Puppet, which tells the puppet master to just sign the new nodes automatically:

https://puppet.com/docs/puppet/latest/ssl_autosign.html

some people do Naive autosigning in development, but manually verify new nodes before signing them. it's basically what we do: the copy-paste script we have does that, somehow.

a possible improvement on that is "policy autosigning" where the puppetmaster delegates to an external program the task of verifying the certificate. the external program gets the CSR and succeeds or fails the verification. presumably the CSR could include some magic secret that the master could verify, but i don't see how this could be used by us.

i think the best way to improve the script would be two-fold:

  1. include the Puppetmaster CA in the install process
  2. install the Puppet package in the install process
  3. add a tpa-puppet-node-add script that takes a sha256 as an argument (or prompt) and signs it after verification on the master
  4. configure puppet to configure itself to run as a cron job instead of a daemon (instead of doing this by hand during the install)

This has a few implications:

  • the puppetmaster is a special snowflake that needs manual reconfiguration of the install process when rebuilt from scratch (already the case)
  • no manual step is required on the new nodes to configure Puppet, as the CA is setup automatically during install
  • Puppet first runs as a daemon, but then needs to configure itself to run as a cron job (or timer) - this is done that way so that we don't have to run puppet by hand during the install
  • the install process *must* communicate the checksum of the agent cert reliably and securely as part of the install process

comment:2 Changed 4 months ago by gaba

Keywords: tpa-roadmap-february added

comment:3 Changed 4 months ago by anarcat

Points: 1

comment:4 Changed 4 months ago by anarcat

Status: acceptedneeds_review

I have deployed changes in the wiki, tsa-misc and Puppet to improve on this.

The gist of the changes are as follows:

  1. there are now two scripts, one for the client, one for the master, which need to be called at about the same time and do approximately what we were doing in the wiki before, except...
  2. instead of copy-pasting commands from the master to the client, we only need to copy-paste the checksum from the client to the master, the remaining commands are hardcoded in the client script
  3. we assume the client cert doesn't need to be copy-pasted from the server back to the client
  4. we inject the Puppet CA *before* we run puppet, which reduces our exposure to MITM attacks

Now, regarding our concerns:

the puppetmaster is a special snowflake that needs manual reconfiguration of the install process when rebuilt from scratch (already the case)

That's still the case. The Puppetmaster CA is valid until 2039 so we're good for 20 years with this setup. We explicitly warn about expiry in the install script as well, although that warning might eventually get lost in the future of course...

no manual step is required on the new nodes to configure Puppet, as the CA is setup automatically during install

that is *mostly* the case with the caveat that we do "--waitforcert" on the client which might hang the installer for two minutes of the operator doesn't approve the certificate fast enough.

Puppet first runs as a daemon, but then needs to configure itself to run as a cron job (or timer) - this is done that way so that we don't have to run puppet by hand during the install

i believe i have fixed that by masking the puppet service before installing the package, but this requires testing.

the install process *must* communicate the checksum of the agent cert reliably and securely as part of the install process

this is still the case, and assumes the operator is interacting with both the puppet client and server during the install. the idea here is that this *could* eventually be automated by an operator using Ansible or Fabric or some external orchestration that can talk to both the client and puppetmaster at the same time.

i haven't figured out how to use autosigning meaningfully. it seems the puppetmaster-side script is simple enough to be easier to maintain than an autosigning hook. and it's deployed through puppet so that should also be maintainable in the future.

next step is to test this on the next new server we create.

comment:5 Changed 3 months ago by anarcat

Status: needs_reviewneeds_revision

comment:6 Changed 3 months ago by anarcat

the current process works! hiro found and fixed a bug, but it otherwise should streamline things a bit better.

unfortunately, the recent ssh firewall changes made the process break ud-replicate as long as puppet doesn't run on the LDAP server so it can open its firewall port.

i am wondering if we should simply skip the "puppet agent -t; ud-replicate" stage on the instance... this will eventually converge anyways, no?

comment:7 Changed 3 months ago by hiro

The other part I am a bit unsure about was cloning the tsa repository. I rather copied over the script. It would be nice if the script could part of the install image.

comment:8 Changed 3 months ago by anarcat

The other part I am a bit unsure about was cloning the tsa repository. I rather copied over the script. It would be nice if the script could part of the install image.

We need to do that for other things in the install procedure, I'd argue that problem is not specific to puppet, but more generally a problem with our install procedure in general (so part of #31239).

That said, I'm heading towards implementing this installer as a client-side SSH wrapper of some sort, which talks to everything magically. In that sense, the puppet bootstrap script would indeed be copied onto the server an ran from there.

But I think this can be considered separate from this specific procedure.

In my mind, the only thing left to check now is to see if we really need this step of the new-machine installer:

  1. do more puppet runs, and run a ud-replicate to get ldap users, then more puppet runs since we now have more users:
puppet agent -t
ud-replicate
puppet agent -t
puppet agent -t

Could we possibly let this converge on its own? Maybe we could try just skipping that step on the next install?

comment:9 Changed 3 months ago by anarcat

another thing we should check is whether we can hook step 5 in the puppet bootstrap (because that's probably why it's there, otherwise it's something puppet could do itself):

  1. sanitize DNS configuration:
grep torproject.org /etc/resolv.conf || ( echo 'domain torproject.org'; echo 'nameserver 8.8.8.8' ) > /etc/resolv.conf
vi /etc/hosts # make sure the local host is there with both FQDN and just hostname

comment:10 Changed 3 months ago by anarcat

step 5 eliminated and moved to the prerequisites (for /etc/hosts) or puppet bootstrap (for /etc/resolv.conf). steps 7 (nevii) and 9 (do more puppet runs) should probably be removed on next run.

comment:11 Changed 3 months ago by anarcat

Resolution: fixed
Status: needs_revisionclosed

tying up loose ends here:

that is *mostly* the case with the caveat that we do "--waitforcert" on the client which might hang the installer for two minutes of the operator doesn't approve the certificate fast enough.

this works in the bootstrap at least. we might not want to do that in the automated systems, but at least the --waitforcert is compatible with --test, which i was worried about.

i believe i have fixed that by masking the puppet service before installing the package, but this requires testing.

i confirm this works.

i am wondering if we should simply skip the "puppet agent -t; ud-replicate" stage on the instance... this will eventually converge anyways, no?

i added this as part of the client bootstrap script.

another thing we should check is whether we can hook step 5 in the puppet bootstrap (because that's probably why it's there, otherwise it's something puppet could do itself):

I moved this to the hetzner-robot installer and made it a requirement.

steps 7 (nevii) and 9 (do more puppet runs) should probably be removed on next run.

done: i confirm that nevii figures it out eventually and step 9 was folded in bootstrap.

i think we're done here. eventually the puppet bootstrap can be merged back into the one big installer, but for now it can't as long as we stick with the "shell script on server" design.

Note: See TracTickets for help on using tickets.