Opened 8 months ago

Last modified 5 months ago

#33406 accepted project

automate reboots

Reported by: anarcat Owned by: anarcat
Priority: Low Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Major Keywords: tpa-roadmap-june
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by anarcat)

in #31957 we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.

we have various mechanisms to do so right now:

  • tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
  • tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
  • misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
  • misc/multi-tool/torproject-reboot-rotation - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
  • ganeti-reboot-cluster - a tool to reboot the ganeti cluster

There are various problems with all this:

  • the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts
  • the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced
  • the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow (see #33412)
  • we have 5 different ways of performing reboots, we should have just one script that does it all
  • reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does)

In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.

I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.

Child Tickets

Change History (12)

comment:1 Changed 8 months ago by anarcat

note that this may very well mean just removing tsa-misc/reboot-host and tsa-misc/reboot-guest, and documenting the multi-tool better. :)

i just tried ./torproject-reboot-rotation and ./torproject-reboot-simple and the unattended operation isn't great... it fires up all those reboots, and doesn't show clearly what it did. for example, it seems to have queued reboots on a bunch of hosts, but it doesn't say which.

after further inspection (with cumin '*' 'screen -ls | grep reboot-job'), i have found it has scheduled reboots on

  • static-master-fsn.torproject.org
  • cdn-backend-sunet-01.torproject.org
  • web-fsn-01.torproject.org
  • onionoo-frontend-01.torproject.org
  • orestis.torproject.org
  • nutans.torproject.org
  • chives.torproject.org
  • onionbalance-01.torproject.org
  • listera.torproject.org
  • peninsulare.torproject.org

Most of those are okay and should return unattended. But in some cases, those could have been covered by a libvirt reboot (i had performed those before, in this case, so non were). Eventually though, that point is moot because we'll all be running under ganeti and will separate host and guest reboot procedures.

one host is problematic in there (chives) as it needs a specific warning to users. maybe chives should be taken out of "justdoit" rotation...

i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.

then i don't know which hosts are left to do manually, but i guess that, with time, nagios will let us know. it would be nice to have a scenario for those as well.

Last edited 7 months ago by anarcat (previous) (diff)

comment:2 in reply to:  1 ; Changed 8 months ago by arma

Replying to anarcat:

i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.

This idea might not at all be worth the hassle of implementing it, but your "rebooting x", "x is back" lines from #tor-project irc seem eminently automatable.

comment:3 in reply to:  2 Changed 8 months ago by anarcat

Replying to arma:

Replying to anarcat:

i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.

This idea might not at all be worth the hassle of implementing it, but your "rebooting x", "x is back" lines from #tor-project irc seem eminently automatable.

That's exactly what I had in mind. The trick is whether individual hosts should connect to IRC to issue those notifications (?!) or whether the calling script should. Either way, we'd need some sort of notification bot, which has been kind of a pain in the arse before in my experience. But maybe we could leverage KGB for this?

It's one of the reasons I'm thinking of rebuilding this system in the first place as well...

Thanks for the feedback!

comment:4 Changed 8 months ago by anarcat

Description: modified (diff)

filed bug #33412 about the ganeti-reboot-cluster bug

comment:5 Changed 8 months ago by anarcat

just for future reference, ganeti-reboot-cluster, as we have in our puppet repo, doesn't work in our cluster, because it relies on assumptions specific to the DSA clusters (namely that the last node is an empty spare). so it fails with:

fsn-node-03.torproject.org not empty.

apparently, the latest version of the script might fix that with the crossmigratemany function:

https://salsa.debian.org/dsa-team/mirror/dsa-puppet/raw/master/modules/ganeti2/files/ganeti-reboot-cluster

for now, i'll just do the reboot by hand.

in theory, rebooting a ganeti node is to:

  1. migrate all the primaries off of the node: ssh $master gnt-migrate -f $node
  2. if it's a master, promote another master: ssh $notmaster gnt-cluster master-failover (optional, only if we can't afford having the master down during the reboot)
  3. reboot the node ssh -tt $node reboot

... for each node.

i'm testing that procedure on fsn-node-03 now.

Last edited 8 months ago by anarcat (previous) (diff)

comment:6 Changed 8 months ago by anarcat

i wrote a simple reboot prototype that does just that, but can also be used as a reboot-guest replacement:

https://gitweb.torproject.org/admin/tsa-misc.git/tree/ganeti-reboot-cluster-fabric-prototype

it's mostly a test to see how Fabric works and is not intended to be a replacement for all tools just yet.

but i find the results promising: it's much nicer to work in python with that stuff: errors are (mostly) well defined and it's easy to modularize things. for example, i originally wrote the thing to migrate fsn-node-01 (and that worked) but then i could extend it to also reboot arbitrary node (and i rebooted gayi).

comment:7 Changed 7 months ago by anarcat

Description: modified (diff)

that prototype is now a library, in https://gitweb.torproject.org/admin/tsa-misc.git/tree/fabric_tpa/reboot.py

it can be called with a wrapper script in https://gitweb.torproject.org/admin/tsa-misc.git/tree/reboot

with something like:

./reboot -H fsn-node-03.torproject.org,...

it handles ganeti nodes, but not libvirt nodes. it therefore replaces the following:

  • tsa-misc/reboot-guest
  • ganeti-reboot-cluster

it *could* also replace the following, provided that (a) a host list is somewhat generated out of band and (b) the operator stays online long enough for the job to complete:

  • misc/multi-tool/torproject-reboot-simple
  • misc/multi-tool/torproject-reboot-rotation - with an explicit 30 minutes delay

The remaining script (tsa-misc/reboot-host) has been marked as deprecated, and will be removed once we get rid of the last KVM/libvirt server (#33084).

So the remaining work here is to extend the reboot script to do an automatic inventory of the hosts requiring a reboot and to schedule them according to policy. We should also make sure the ganeti reboot handlers schedule a rebalance of the cluster when it's done, like it's currently done by ganeti-reboot-cluster. This should be documented in the ganeti and upgrades wiki pages when done.

We also don't check if a reboot is required at all right now, and we should do so. All those "TODO" items are documented in the tsa-misc source code listed above.

Last edited 7 months ago by anarcat (previous) (diff)

comment:8 in reply to:  2 Changed 7 months ago by anarcat

Replying to arma:

Replying to anarcat:

i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.

This idea might not at all be worth the hassle of implementing it, but your "rebooting x", "x is back" lines from #tor-project irc seem eminently automatable.

This is getting closer to reality now. There's a KGB bot living on chives now (but just use the kgb-bot.torproject.org alias instead) that can be used for such notifications. It's not hooked into fabric just yet, but that's the next step. With the configuration from /etc/kgb-client-tpa.conf, one can do:

kgb-client --conf kgb-client-tpa.conf --relay-msg test

... and that will say "test" in #tor-project and #tor-bots. This is obviously configurable, but the next step here is to find the best way to hook this into Fabric.

I'm tempted to just shell out locally and do exactly the above to send notifications, as opposed to implementing a full KGB client in Python (!). But then again, "it's just JSON-RPC with some authentication mechanism". And we just use the "relay_message" bit:

https://manpages.debian.org/unstable/kgb-bot/kgb-protocol.7p.en.html#relay_message_message

... so "how hard could it be"?

Fun times.

comment:9 Changed 7 months ago by anarcat

i did more work on the reboot procedures today, and rebooted the ganeti cluster using the reboot command. there were some issues with the initrd interfering with the wait_for_boot (now called wait_for_ping) checks so I did some refactoring, but i'm still confused about the exception that's raised by Fabric in this case.

the exception I got here is:

    All instances migrated successfully.
    Shutdown scheduled for Thu 2020-04-02 18:30:55 UTC, use 'shutdown -c' to cancel.
    waiting 0 minutes for reboot to happen
    waiting up to 30 seconds for host to go down
    waiting 300 seconds for host to go up
    host fsn-node-01.torproject.org should be back online, checking uptime
    Traceback (most recent call last):
      File "./reboot", line 132, in <module>
        logging.getLogger(mod).setLevel('WARNING')
      File "./reboot", line 116, in main
        delay_up=args.delay_up,
      File "/usr/lib/python3/dist-packages/invoke/tasks.py", line 127, in __call__
        result = self.body(*args, **kwargs)
      File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/reboot.py", line 197, in shutdown_and_wait
        res = con.run('uptime', watchers=[responder], pty=True, warn=True)
      File "<decorator-gen-3>", line 2, in run
      File "/usr/lib/python3/dist-packages/fabric/connection.py", line 29, in opens
        self.open()
      File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/__init__.py", line 106, in safe_open
        Connection.open_orig(self)
      File "/usr/lib/python3/dist-packages/fabric/connection.py", line 634, in open
        self.client.connect(**kwargs)
      File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349, in connect
        retry_on_signal(lambda: sock.connect(addr))
      File "/usr/lib/python3/dist-packages/paramiko/util.py", line 280, in retry_on_signal
        return function()
      File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349, in <lambda>
        retry_on_signal(lambda: sock.connect(addr))
    TimeoutError: [Errno 110] Connection timed out

maybe the exception gets generated *above* our code, in the fabric task handler itself, in which case it might mean we shouldn't use a @task for this at all, at least in our code.

comment:10 Changed 6 months ago by anarcat

Keywords: tpa-roadmap-april added; tpa-roadmap-march removed
Owner: changed from tpa to anarcat
Status: newaccepted

comment:11 Changed 6 months ago by anarcat

Keywords: tpa-roadmap-may added; tpa-roadmap-april removed

i fixed the timeout error, and did today's round of upgrades without too many problems. one issue that came up is that ganeti wasn't happy to chain-reboot machines: some instances had to have a activate-disks ran so they recognize their secondary. that has been added as a TODO in the code.

i also made some experiments with feeding LDAP hosts lists as an argument to the reboot command which also worked well. this, for example, rebooted the rotation hosts with a 10-minute delay:

./reboot -H $(ssh alberti.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b dc=torproject,dc=org -LLL "(&(hostname=*.torproject.org)(rebootPolicy=rotation))" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort') -v

I added a modified recipe to the upgrades page, which covers all cases.

I also set the reboot policy on a few hosts so they are classified properly, those didn't have a policy, and now have:

manual:

  • moly (KVM, requires special handling)
  • kvm4 (KVM)
  • kvm5 (KVM)
  • scw-arm-par1 (buggy buildbox, see #32920)
  • fsn-node-01 (ganeti, requires special handling)
  • fsn-node-02 (ganeti)
  • fsn-node-03 (ganeti)
  • weissii (windows buildbox, no ssh)
  • woronowii (windows buildbox, no ssh)
  • winklerianum (windows buildbox, no ssh)

justdoit:

  • pauli (puppet)
  • rude (rt)
  • alberti (ldap)
  • eugeni (mail)
  • majus (translation)
  • rouyi (jenkins)
  • troodi (trac)
  • nevii (dns primary)
  • henryi (consensus-health)
  • vineale (gitweb)
  • gayi (svn)
  • polyanthum (bridges)
  • materculae (exonerator)
  • meronense (metrics.tpo)
  • colchicifolium (collector backend)
  • carinatum (DocTor)
  • build-x86-05 (buildbox)
  • build-x86-06 (buildbox)
  • build-x86-08 (buildbox)
  • build-x86-09 (buildbox)
  • perdulce (people.tpo)
  • staticiforme (static master)
  • forrestii (fpcentral)
  • subnotabile (survey)
  • crm-int-01 (CRM backend)
  • crm-ext-01 (CRM frontend)
  • submit-01 (mail)

rotation:

  • fallax (DNS secondary)
  • omeiense (onionoo backend)
  • oo-hetzner-03 (onionoo backend)
  • neriniflorum (DNS secondary)
  • web-hetzner-01 (web frontend)
  • web-cymru-01 (web frontend)

the following were already configured as...

rotation:

  • orestis (onionoo backend)
  • nutans (DNS secondary)
  • cdn-backend-sunet-01 (web frontend)
  • hetzner-hel1-02 (DNS secondary)
  • hetzner-hel1-03 (web frontend)
  • onionoo-backend-01 (onionoo backend)
  • web-fsn-01 (web frontend)
  • web-fsn-02 (web frontend)
  • onionoo-frontend-01 (onionoo frontend)
  • cache01 (cache frontend)
  • cache-02 (cache frontend)
  • onionoo-backend-02 (onionoo backend)

justdoit:

  • corsicum (collector)
  • hetzner-hel1-01 (nagios)
  • bungei (backup storage)
  • hetzner-nbg1-01 (prometheus)
  • hetzner-nbg1-02 (prometheus)
  • archive-01 (non-redundant web frontend)
  • loghost01 (syslog)
  • static-master-fsn (static master)
  • bacula-director-01 (backup director)
  • gettor-01 (gettor)
  • onionbalance-01 (onionbalance)
  • chives (IRC)
  • build-arm-10 (buildbox)
  • tbb-nightlies-master (static master)
  • gitlab-02 (gitlab)
  • check-01 (check.tpo)

manual:

  • mandos-01 (mandos, requires crypto)
  • fsn-node-04
  • fsn-node-05

In other words, I made the following diff in LDAP:

--- policy-before	2020-04-30 19:48:50.158412413 -0400
+++ policy-after	2020-04-30 19:54:15.209832522 -0400
@@ -6,27 +6,35 @@
 
 dn: host=moly,ou=hosts,dc=torproject,dc=org
 host: moly
+rebootPolicy: manual
 
 dn: host=pauli,ou=hosts,dc=torproject,dc=org
 host: pauli
+rebootPolicy: justdoit
 
 dn: host=rude,ou=hosts,dc=torproject,dc=org
 host: rude
+rebootPolicy: justdoit
 
 dn: host=alberti,ou=hosts,dc=torproject,dc=org
 host: alberti
+rebootPolicy: justdoit
 
 dn: host=cupani,ou=hosts,dc=torproject,dc=org
 host: cupani
+rebootPolicy: justdoit
 
 dn: host=fallax,ou=hosts,dc=torproject,dc=org
 host: fallax
+rebootPolicy: rotation
 
 dn: host=eugeni,ou=hosts,dc=torproject,dc=org
 host: eugeni
+rebootPolicy: justdoit
 
 dn: host=majus,ou=hosts,dc=torproject,dc=org
 host: majus
+rebootPolicy: justdoit
 
 dn: host=listera,ou=hosts,dc=torproject,dc=org
 host: listera
@@ -34,63 +42,83 @@
 
 dn: host=rouyi,ou=hosts,dc=torproject,dc=org
 host: rouyi
+rebootPolicy: justdoit
 
 dn: host=palmeri,ou=hosts,dc=torproject,dc=org
 host: palmeri
+rebootPolicy: justdoit
 
 dn: host=weissii,ou=hosts,dc=torproject,dc=org
 host: weissii
+rebootPolicy: manual
 
 dn: host=troodi,ou=hosts,dc=torproject,dc=org
 host: troodi
+rebootPolicy: justdoit
 
 dn: host=nevii,ou=hosts,dc=torproject,dc=org
 host: nevii
+rebootPolicy: justdoit
 
 dn: host=henryi,ou=hosts,dc=torproject,dc=org
 host: henryi
+rebootPolicy: justdoit
 
 dn: host=vineale,ou=hosts,dc=torproject,dc=org
 host: vineale
+rebootPolicy: justdoit
 
 dn: host=gayi,ou=hosts,dc=torproject,dc=org
 host: gayi
+rebootPolicy: justdoit
 
 dn: host=polyanthum,ou=hosts,dc=torproject,dc=org
 host: polyanthum
+rebootPolicy: justdoit
 
 dn: host=materculae,ou=hosts,dc=torproject,dc=org
 host: materculae
+rebootPolicy: justdoit
 
 dn: host=omeiense,ou=hosts,dc=torproject,dc=org
 host: omeiense
+rebootPolicy: rotation
 
 dn: host=meronense,ou=hosts,dc=torproject,dc=org
 host: meronense
+rebootPolicy: justdoit
 
 dn: host=colchicifolium,ou=hosts,dc=torproject,dc=org
 host: colchicifolium
+rebootPolicy: justdoit
 
 dn: host=carinatum,ou=hosts,dc=torproject,dc=org
 host: carinatum
+rebootPolicy: justdoit
 
 dn: host=build-x86-05,ou=hosts,dc=torproject,dc=org
 host: build-x86-05
+rebootPolicy: justdoit
 
 dn: host=build-x86-06,ou=hosts,dc=torproject,dc=org
 host: build-x86-06
+rebootPolicy: justdoit
 
 dn: host=perdulce,ou=hosts,dc=torproject,dc=org
 host: perdulce
+rebootPolicy: justdoit
 
 dn: host=staticiforme,ou=hosts,dc=torproject,dc=org
 host: staticiforme
+rebootPolicy: justdoit
 
 dn: host=woronowii,ou=hosts,dc=torproject,dc=org
 host: woronowii
+rebootPolicy: manual
 
 dn: host=winklerianum,ou=hosts,dc=torproject,dc=org
 host: winklerianum
+rebootPolicy: manual
 
 dn: host=orestis,ou=hosts,dc=torproject,dc=org
 host: orestis
@@ -106,21 +134,27 @@
 
 dn: host=kvm4,ou=hosts,dc=torproject,dc=org
 host: kvm4
+rebootPolicy: manual
 
 dn: host=oo-hetzner-03,ou=hosts,dc=torproject,dc=org
 host: oo-hetzner-03
+rebootPolicy: rotation
 
 dn: host=forrestii,ou=hosts,dc=torproject,dc=org
 host: forrestii
+rebootPolicy: justdoit
 
 dn: host=subnotabile,ou=hosts,dc=torproject,dc=org
 host: subnotabile
+rebootPolicy: justdoit
 
 dn: host=kvm5,ou=hosts,dc=torproject,dc=org
 host: kvm5
+rebootPolicy: manual
 
 dn: host=neriniflorum,ou=hosts,dc=torproject,dc=org
 host: neriniflorum
+rebootPolicy: rotation
 
 dn: host=hetzner-hel1-01,ou=hosts,dc=torproject,dc=org
 host: hetzner-hel1-01
@@ -132,12 +166,15 @@
 
 dn: host=build-x86-08,ou=hosts,dc=torproject,dc=org
 host: build-x86-08
+rebootPolicy: justdoit
 
 dn: host=web-hetzner-01,ou=hosts,dc=torproject,dc=org
 host: web-hetzner-01
+rebootPolicy: rotation
 
 dn: host=scw-arm-par-01,ou=hosts,dc=torproject,dc=org
 host: scw-arm-par-01
+rebootPolicy: manual
 
 dn: host=hetzner-hel1-02,ou=hosts,dc=torproject,dc=org
 host: hetzner-hel1-02
@@ -149,15 +186,19 @@
 
 dn: host=web-cymru-01,ou=hosts,dc=torproject,dc=org
 host: web-cymru-01
+rebootPolicy: rotation
 
 dn: host=crm-int-01,ou=hosts,dc=torproject,dc=org
 host: crm-int-01
+rebootPolicy: justdoit
 
 dn: host=crm-ext-01,ou=hosts,dc=torproject,dc=org
 host: crm-ext-01
+rebootPolicy: justdoit
 
 dn: host=build-x86-09,ou=hosts,dc=torproject,dc=org
 host: build-x86-09
+rebootPolicy: justdoit
 
 dn: host=bungei,ou=hosts,dc=torproject,dc=org
 host: bungei
@@ -181,9 +222,11 @@
 
 dn: host=fsn-node-01,ou=hosts,dc=torproject,dc=org
 host: fsn-node-01
+rebootPolicy: manual
 
 dn: host=fsn-node-02,ou=hosts,dc=torproject,dc=org
 host: fsn-node-02
+rebootPolicy: manual
 
 dn: host=loghost01.torproject.org,ou=hosts,dc=torproject,dc=org
 host: loghost01
@@ -243,6 +286,7 @@
 
 dn: host=fsn-node-03,ou=hosts,dc=torproject,dc=org
 host: fsn-node-03
+rebootPolicy: manual
 
 dn: host=onionoo-backend-02,ou=hosts,dc=torproject,dc=org
 host: onionoo-backend-02

The policy is being interpreted here as:

  • manual: requires manual intervention or special tools (fabric in case of ganeti, reboot-host in the case of KVM, nothing for windows boxes)
  • justdoit: can be rebooted with proper prior warning (10 minutes), possibly in parallel with each other
  • rotation: must not be rebooted together, longer warning (30 minutes)

I tried to update the "upgrades" docs to reflect this.

I think the last steps here are:

  1. add LDAP support in the reboot script
  2. parallelize "justdoit" jobs
  3. turn ganeti hosts into "rotation" once we officialize this new procedure

This is therefore likely to be completed in may.

comment:12 Changed 5 months ago by anarcat

Keywords: tpa-roadmap-june added; tpa-roadmap-may removed

i obviously did not have time to complete this in may, and i'm unlikely to do so in june either, but just in case, moving there.

Note: See TracTickets for help on using tickets.