Opened 15 months ago

Last modified 6 months ago

#31239 assigned enhancement

automate installs

Reported by: anarcat Owned by: anarcat
Priority: Low Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords: tpa-roadmap-november
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.

some of it is done by hand, some is done in puppet.

we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.

this ticket aims at documenting what we already have and where we could possibly go. this is one of the question we answered "no" on in the "ops questionnaire" in #30881. see also the automated upgrade part in #31957.

When we started this work, the installer had this many manual steps:

  • new-machine (common trunk): 14 steps
  • new-machine-hetzner-robot: +43 steps (57 total)
  • new-machine-hetzner-cloud: +21 steps (35 total)

Child Tickets

TicketTypeStatusOwnerSummary
#32283defectclosedanarcatfix up /etc/aliases with puppet
#32901projectassignedanarcatpuppetize Nagios
#32902taskclosedhirodocument the current install workflow visually
#32914taskclosedanarcatreview the puppet bootstrapping process
#33143taskclosedanarcatferm: convert BASE_SSH_ALLOWED rules into puppet exported rules
#33332taskassignedhiromove root passwords to trocla?
#33387taskclosedanarcatestablish tmpfs policy

Change History (26)

comment:1 Changed 15 months ago by anarcat

right now the "installers" are shell scripts and snippets in tsa-misc. there's a tor-install-hetzner monolithic script that has been used to install virtual machines, and other scripts that are "chunks" of things that can be done on new servers (partitionning, LDAP entry, luks setup).

the process is documented in new-machine.

comment:2 Changed 15 months ago by anarcat

possible tools to research further:

  • cobbler - takes care of PXE and boot, delegates to kickstart the autoinstall, more relevant to RPM-based distros
  • terraform - config management for the cloud kind of thing, supports Hetzner Cloud, but not Hetzner Robot or Ganeti
  • FAI - built by a debian developer, used to build live images since buster, might require complex setup (e.g. an NFS server), setup-storage(8) might be reusable on its own
  • list of debian setup tools, see also AutomatedInstallation
  • himblock has some interesting post-install configure bits in Python, along with pyparted bridges
  • livewrapper is also one of those installers, in a way

Unfortuantely, I ruled out the official debian-installer because of the complexity of the preseeding system and partman.

Update: that list is now maintained in https://help.torproject.org/tsa/howto/new-machine/#Alternatives_considered

Last edited 9 months ago by anarcat (previous) (diff)

comment:3 Changed 12 months ago by anarcat

Owner: changed from tpa to anarcat
Status: newassigned

i want to tackle this. i think we're pretty close with the ganeti stuff and the half-assed installer I wrote, but i would maybe like to make a spec on how to phase out and replace, or improve the latter. maybe our installer could be formally released as a standalone thing, if only to get feedback from the community and provoke some discussion and maybe something better. right now, Debian is still working on the debian-installer distribution (for servers) and calamares (for desktop), none of which are a good fit for our environment.

as far as VMs are concerned, the non-ganeti installers should be progressively phased out as we migrate everything into ganeti cluster(s), so that is probably a non-issue. there was a bug with the ganeti installer (#31781) but that should (eventually) be fixed upstream or in puppet.

comment:4 Changed 11 months ago by anarcat

Description: modified (diff)

link to the auto upgrade and questionnaire bits.

comment:5 Changed 11 months ago by anarcat

i had a nice chat with Thomas Lange who confirmed a few things about FAI:

  • it requires a server (fai-server to be more precise)
  • it needs control over the boot environment (custom ISO or PXE + NFS)
  • it does *not* use the debian-installer, instead the base system is installed through tar files which have the same content as a debootstrap call
  • preseeding works by running dpkg-reconfigure on the packages part of the tar file
  • custom FAI-enabled boot images are available from https://fai-project.org/FAIme/ but you can also create your own
  • setup-storage can be used without an installer

comment:6 Changed 10 months ago by anarcat

i created a "discussion" section in the new machine wiki page where i copied the alternatives listed earlier here and added a few. documentation on those tools should be done over there from here on.

comment:7 Changed 10 months ago by anarcat

in #32902, hiro and I played with draw.io to draw diagrams of what the current install process looks like. it was a fun exercise, and showed a few interesting things:

  • too much duplication between the two disk formatters, which should be resolved
  • duplication between the disk formatters and luks-setup
  • inconsistencies between sites: hrobot writes authorized-keys in /root/.ssh, hcloud in /etc/ssh/userkeys/, one uses grml-debootstrap, the other debootstrap

I'm leaning towards scrapping the current install process and converging towards a simpler process that would be basically:

  1. pick IP address, hostname and other static parameters
  2. create metal/cloud upstream
  3. get a console (ssh, web console, whatever)
  4. use setup-storage to partition the disk, based on well-defined templates
  5. mount everything
  6. run debootstrap
  7. setup network, including hostname (maybe reusing gnt-network stuff?)
  8. populate LDAP
  9. bootstrap Puppet in the chroot
  10. reboot

Every remaining manual step can then be done in Puppet, as it runs before the first boot. Those steps, currently done manually, are already done by Puppet so automating this is just a matter of ordering:

  • SSH daemon and keys configuration
  • automated upgrades (part of the larger #31957)
  • /etc/hosts management?

Those would need some coding work in Puppet:

  • root password management (trocla? abandon?)
  • swapfile (move to setup-storage?)
  • kernel and grub setup?
  • mdadm.conf, fstab and crypttab config (setup-storage?)
  • dropbear-initramfs setup
  • mandos setup
  • net.ifnames=0

Those steps would stay manual until they are configured in Puppet.

So the next step seems to be to experiment with changing the order of the install process to bootstrap Puppet earlier and see what happens. We should also experiment with a different partionning tool, probably setup-storage.

TL;DR: next steps:

  1. test setup-storage
  2. bootstrap Puppet earlier

comment:8 Changed 10 months ago by hiro

I agree that the current install process has too many manual bits and needs to be improved. I'd like to get to a point where we have as much as possible into puppet and a few as possible scripts to bootstrap the system. The idea to use ansible up to the point where puppet kicks in is great in this sense imo.

comment:9 Changed 10 months ago by anarcat

The idea to use ansible up to the point where puppet kicks in is great in this sense imo.

I'd be open to this idea. But before I would start messing around with Ansible, I'd do things by hand and refactor things around Puppet. I'm not familiar enough with Ansible to be confident I would go anywhere. :p

One problem I feel is inherent to Ansible is that it has its own bootstrap problems. We first need to setup SSH to get it working, and that means fiddling around with networking and SSH configuration by hand. But maybe that would be easier than bootstrapping the entire host (partitionning, networking and debootstrap) by hand?

Or is there an easier way to bootstrap ansible? Could we git clone an ansible playbook on new hosts and run it directly from there?

comment:10 Changed 10 months ago by anarcat

one thing to consider is that if we're ready to go the pure-systemd way, we can totally get rid of /etc/fstab and rely on the magics of systemd for boot.

https://www.freedesktop.org/wiki/Specifications/DiscoverablePartitionsSpec/
https://wiki.archlinux.org/index.php/Systemd#GPT_partition_automounting
https://wiki.archlinux.org/index.php/Swap#Activation_by_systemd

this way we just have to partition and format the disks 'just so', mount and debootstrap and everything follows.

comment:11 Changed 9 months ago by anarcat

#32937 has seen a fairly successful install using setup-storage that would remove the need for custom shell scripts in favor of reusable, fairly readable config files.

i've also reshuffled the new-machine-hetzner-robot docs in that direction, but the scripts still need to be removed and teh docs updated accordingly.

comment:12 Changed 9 months ago by anarcat

the installer/tor-install-format-disks-nvme+hdds script was rewritten to use setup-storage. the docs don't really need an update since they just tell the operator to look around for the script.

once we have converted the other partitionner, however, we might want to change the rest of the install procedure to assume we have used setup-storage and source /tmp/fai/disk_var.sh to get the BOOT_DEVICE, which we currently prompt for.

comment:13 Changed 9 months ago by anarcat

i did just that and ditched the formatting script, which is now just a legacy wrapper.

comment:14 Changed 9 months ago by gaba

Keywords: tpa-roadmap-february added

comment:15 Changed 9 months ago by anarcat

one part that was missing in our documentation is the firewall setup. we had network allow blocks covering all hosts configured by hand in tor-puppet/modules/ferm/templates/defs.conf.erb. instead of updating the install docs, I just fixed this and shoved it in puppet, in #33143.

comment:16 Changed 9 months ago by anarcat

removed two more steps: the /etc/aliases junk (#32283) and the portmap/etc package removal (also done in puppet).

comment:17 Changed 9 months ago by anarcat

Priority: MediumLow

comment:18 Changed 9 months ago by anarcat

Description: modified (diff)

Document how many steps we had when we drew the diagrams:

When we started this work, the installer had this many manual steps:

  • new-machine (common trunk): 14 steps
  • new-machine-hetzner-robot: +43 steps (57 total)
  • new-machine-hetzner-cloud: +21 steps (35 total)

Now we're at:

  • new-machine (common trunk): 13 steps (3 steps possibly obsolete, 4 more being worked on)
  • new-machine-hetzner-robot: +25 steps left (38 total)
  • new-machine-hetzner-cloud: +21 steps (35 total, unchanged, needs to merge with setup-storage process)

i.e. we have eliminated a whopping 19 steps, most of which through the setup-storage refactoring.

comment:19 Changed 8 months ago by anarcat

while setting up the fsn-node-04 server, i got the checklist from 17 to 12 steps, with 5 of those being only safety checks! we're under way to have this being a single "deploy git repo and run this one command" installer :)

comment:20 Changed 8 months ago by anarcat

removed another 4 steps from the common trunk, we're now at 9 steps there, which are fairly streamlined and can't be trimmed further without changing the design (ie. we need orchestration).

we're now at this state:

  • new-machine (common trunk): 9 steps
  • new-machine-hetzner-robot: +12 steps (21 total), many of which can be merged into hooks next time
  • new-machine-hetzner-cloud: unchanged

comment:21 Changed 8 months ago by anarcat

one thing that might be interesting is to look at stuff the grml people are doing in production. this here is a grml-debootstrap wrapper that does a bunch of interesting things:

https://github.com/sipwise/deployment-iso/blob/1b1e54b822b8af6b6c691993eae9d6589ed8b483/templates/scripts/includes/deployment.sh#L2175

namely:

  • EFI support
  • grub configuration (e.g. net.ifnames=0)
  • multiple disks support (reported upstream as bug 152)
  • mmdebootstrap instead of debootstrap (simply export DEBOOTSTRAP=mmdebstrap!)
  • third-party repo configuration
  • etckeeper configuration
  • /etc/hosts configuration
  • a partitionning shell script
  • a reset /etc/debootstrap/packages (just like us)
  • automated grml-debootstrap run (echo y | grml-debootstrap??)
  • an elaborate puppet bootstrap

comment:22 Changed 8 months ago by anarcat

Keywords: tpa-roadmap-april added; tpa-roadmap-february removed

comment:23 Changed 7 months ago by anarcat

today, i did a new-machine-hetzner-robot process almost entirely automatically, using fabric, with the followign command:

./install -H root@88.99.194.57 --fingerprint 0d:4a:c0:85:c4:e1:fe:03:15:e0:99:fe:7d:cc:34:f7 --verbose hetzner-robot fsn-node-05.torproject.org installer/disk-config/gnt-fsn-NVMe installer/packages installer/post-scripts/

the fingerprint was the ed25519 one provided by hetzner email.

this is a major step in the automation work because we reviewed the way Fabric handles remote hosts SSH keys (it doesn't, ouch), and worked around the problems found. we especially were able to add the --fingerprint argument *fairly* easily once I understood the internal mechanics of Paramiko (which wasn't quite obvious).

the next step of this process is to finish converting the common trunk, new-machine, into fabric, so that (e.g.) puppet procedures are fully automated.

but i can believe this can wait until the next server. doing this install took about a day because of the automation, so we shouldn't burn too much work credit on that...

comment:24 Changed 6 months ago by anarcat

we might have a problem with automated installs using debootstrap, as it sets up usrmerge by default, which seems to cause significant problems:

https://wiki.debian.org/Teams/Dpkg/MergedUsr

we might want to switch to mmdebstrap for performance if not reliability anyways.

this will at least need research and testing to confirm this is a problem.

comment:25 Changed 6 months ago by anarcat

i filed #34115 to followup on usrmerge.

comment:26 Changed 6 months ago by anarcat

Keywords: tpa-roadmap-november added; tpa-roadmap-april removed

running out of time to do more automation, so pushing back 6 months.

Note: See TracTickets for help on using tickets.