Opened 7 months ago

Closed 6 weeks ago

#32198 closed task (fixed)

upgrade CRM* machines to buster

Reported by: anarcat Owned by: hiro
Priority: Medium Milestone:
Component: Internal Services/Services Admin Team Version:
Severity: Normal Keywords: tpa-roadmap-february tpa-roadmap-march
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

those machines are downtime-sensitive enough to warrant a tracking ticket to ensure proper coordination among all teams.

we originally wanted to do this before november, but time is running out, so this is being pushed out to january.

in the meantime, we could consider migrating the machines to the FSN cluster to ensure filesystem-level snapshot to give us rollback capabilities. we should do this in a near-zero downtime migration, that said.

Child Tickets

Change History (14)

comment:1 Changed 6 months ago by anarcat

Owner: set to hiro
Status: newassigned

let's make a checklist. one thing that needs to happen first is to move to a machine that supports filesystem snapshots for easier rollbacks, so this might need some action in december.

comment:2 Changed 6 months ago by hiro

Steps to complete in December:

[ ] Create a node to the FSN cluster and replicate functionality of the crm VM
[ ] Have all the configuration in puppet
[ ] Copy current setup to the new node and test that everything is working
[ ] Start decommissioning old crm VM

Last edited 6 months ago by hiro (previous) (diff)

comment:3 Changed 4 months ago by anarcat

i haven't worked on snapshotting in the ganeti cluster yet, but Riseup have interesting documentation on how to clone VMs that might be interesting for us:

https://we.riseup.net/riseup+tech/ganeti#cloning-an-instance

comment:4 Changed 4 months ago by gaba

Keywords: tpa-roadmap-february added

comment:5 Changed 3 months ago by hiro

Keywords: tpa-roadmap-march added

comment:6 Changed 2 months ago by anarcat

I will soon copy those virtual machines to the new ganeti cluster. this will involve an IP address change which might affect the service.

Please let me know if there are any problems you can think of. in particular, do let me know if any internal (inside the server) or external (outside the server) services hardcodes the IP address of the virtual machine.

After the copy, it will be able to test the new virtual machines by changing your /etc/hosts. Once that's done, we will look at doing the buster upgrade and if all goes well, we'll migrate the machines, with a scheduled downtime.

When should we perform the actual upgrade?

comment:7 Changed 2 months ago by anarcat

crm-ext-01 IP address changed in the new server (but not external DNS):

--- /mnt/etc/network/interfaces.bak	2019-07-29 19:57:38.233272523 +0000
+++ /mnt/etc/network/interfaces	2020-03-25 21:45:37.468998111 +0000
@@ -1,12 +1,16 @@
+# This file describes the network interfaces available on your system
+# and how to activate them. For more information, see interfaces(5).
+
+# The loopback network interface
 auto lo
 iface lo inet loopback
 
-allow-hotplug eth0
+# The primary network interface
+auto eth0
 iface eth0 inet static
-    address 138.201.212.236/28
-    gateway 138.201.212.225
+    address 116.202.120.190/27
+    gateway 116.202.120.161
 iface eth0 inet6 static
     accept_ra 0
-    address 2a01:4f8:172:39ca:0:dad3:12:1/96
-    gateway 2a01:4f8:172:39ca:0:dad3:0:1
-source /etc/network/interfaces.d/*
+    address 2a01:4f8:fff0:4f:266:37ff:fe9d:2d74/64
+    gateway 2a01:4f8:fff0:4f::1

use this to add the new IP to local DNS: printf "116.202.120.190 crm-ext-01.torproject.org\n2a01:4f8:fff0:4f:266:37ff:fe9d:2d74 crm-ext-01.torproject.org\n" >> /etc/hosts

crm-int-01:

--- /mnt/etc/network/interfaces.bak	2019-07-29 19:45:22.839197179 +0000
+++ /mnt/etc/network/interfaces	2020-03-25 21:45:57.820829459 +0000
@@ -1,13 +1,16 @@
+# This file describes the network interfaces available on your system
+# and how to activate them. For more information, see interfaces(5).
+
+# The loopback network interface
 auto lo
 iface lo inet loopback
 
-allow-hotplug eth0
+# The primary network interface
+auto eth0
 iface eth0 inet static
-    address 138.201.212.235/28
-    gateway 138.201.212.225
+    address 116.202.120.186/27
+    gateway 116.202.120.161
 iface eth0 inet6 static
     accept_ra 0
-    address 2a01:4f8:172:39ca:0:dad3:11:1/96
-    gateway 2a01:4f8:172:39ca:0:dad3:0:1
-
-source /etc/network/interfaces.d/*
+    address 2a01:4f8:fff0:4f:266:37ff:fed7:6ae2/64
+    gateway 2a01:4f8:fff0:4f::1

use this to add the new IP to local DNS: printf "116.202.120.186 crm-int-01.torproject.org\n2a01:4f8:fff0:4f:266:37ff:fed7:6ae2 crm-int-01.torproject.org\n" >> /etc/hosts

the test machines are currently shutdown but can be started by TPA staff at any time with, for example:

gnt-instance start crm-int-01.torproject.org

i am keeping them offline because i'm not sure what will happen if they run concurrently with the real instances.

comment:8 Changed 2 months ago by hiro

Anarcat I am waiting for GR to get back to me regarding when they are available to test the switch and then I'll take over.

Thanks for this.

comment:9 Changed 2 months ago by hiro

GR is suggesting to start testing in the week starting April 13 (roughly 3 weeks from now), then plan for the migration on Tuesday, April 21.

comment:10 Changed 7 weeks ago by hiro

This is a possible timeline for migrating the CRM machine (to be confirmed with GR)

  1. libvirt-import, renumber-instances, DNS TTL, DRBD change, implies few minutes downtime, anarcat, wed 8th
  2. hack at ipsec, anarcat, wed 8th
  3. test, anarcat, wed 8th
  4. buster upgrade, hiro, thu 9th
  5. test, hiro, thu 9th
  6. put Drupal in maintenance mode on old CRMs, GR, wed 15th
  7. sync mysql and uploads, GR, wed 15th
  8. shutdown old CRM, hiro/anarcat, wed 15th (step 8 in the ganeti procedure)
  9. final DNS change, hiro/anarcat, wed 15th (step 11 and 12 in the ganeti procedure)
Last edited 7 weeks ago by hiro (previous) (diff)

comment:11 Changed 7 weeks ago by anarcat

the instances were re-synchronized to the ganeti cluster. here are the IP addresses changes:

crm-ext-01:

--- /mnt/etc/network/interfaces.bak	2020-04-08 14:16:21.742548944 +0000
+++ /mnt/etc/network/interfaces	2020-04-08 14:16:23.034539191 +0000
@@ -1,12 +1,18 @@
+# This file describes the network interfaces available on your system
+# and how to activate them. For more information, see interfaces(5).
+
+source /etc/network/interfaces.d/*
+
+# The loopback network interface
 auto lo
 iface lo inet loopback
 
-allow-hotplug eth0
+# The primary network interface
+auto eth0
 iface eth0 inet static
-    address 138.201.212.236/28
-    gateway 138.201.212.225
+    address 116.202.120.190/27
+    gateway 116.202.120.161
+
+# IPv6 configuration
 iface eth0 inet6 static
     accept_ra 0
-    address 2a01:4f8:172:39ca:0:dad3:12:1/96
-    gateway 2a01:4f8:172:39ca:0:dad3:0:1
-source /etc/network/interfaces.d/*
+    address 2a01:4f8:fff0:4f:266:37ff:fe63:6385/64
+    gateway 2a01:4f8:fff0:4f::1
copying /mnt/etc/hosts to /mnt/etc/hosts.bak on fsn-node-05.torproject.org
rewriting host file /mnt/etc/hosts on <Connection host=fsn-node-05.torproject.org user=root>
--- /mnt/etc/hosts.bak	2020-04-08 14:16:26.342514217 +0000
+++ /mnt/etc/hosts	2020-04-08 14:16:28.794495703 +0000
@@ -3,7 +3,7 @@
 ##
 
 127.0.0.1       localhost
-138.201.212.236        crm-ext-01.torproject.org crm-ext-01
+116.202.120.190 crm-ext-01.torproject.org crm-ext-01
 
 # The following lines are desirable for IPv6 capable hosts
 ::1     localhost ip6-localhost ip6-loopback
@@ -12,3 +12,4 @@
 ff02::1 ip6-allnodes
 ff02::2 ip6-allrouters
 ff02::3 ip6-allhosts
+2a01:4f8:fff0:4f:266:37ff:fe63:6385 crm-ext-01.torproject.org crm-ext-01

crm-int-01:

--- /mnt/etc/network/interfaces.bak	2020-04-08 14:27:42.893395619 +0000
+++ /mnt/etc/network/interfaces	2020-04-08 14:27:44.265385216 +0000
@@ -1,13 +1,20 @@
+# This file describes the network interfaces available on your system
+# and how to activate them. For more information, see interfaces(5).
+
+source /etc/network/interfaces.d/*
+
+# The loopback network interface
 auto lo
 iface lo inet loopback
 
-allow-hotplug eth0
+# The primary network interface
+auto eth0
 iface eth0 inet static
-    address 138.201.212.235/28
-    gateway 138.201.212.225
+    address 116.202.120.186/27
+    gateway 116.202.120.161
+
+# IPv6 configuration
 iface eth0 inet6 static
     accept_ra 0
-    address 2a01:4f8:172:39ca:0:dad3:11:1/96
-    gateway 2a01:4f8:172:39ca:0:dad3:0:1
-
-source /etc/network/interfaces.d/*
+    address 2a01:4f8:fff0:4f:266:37ff:fe4d:f883/64
+    gateway 2a01:4f8:fff0:4f::1
copying /mnt/etc/hosts to /mnt/etc/hosts.bak on fsn-node-05.torproject.org
rewriting host file /mnt/etc/hosts on <Connection host=fsn-node-05.torproject.org user=root>
--- /mnt/etc/hosts.bak	2020-04-08 14:27:47.821358271 +0000
+++ /mnt/etc/hosts	2020-04-08 14:27:50.201340236 +0000
@@ -3,7 +3,7 @@
 ##
 
 127.0.0.1       localhost
-138.201.212.235        crm-int-01.torproject.org crm-int-01
+116.202.120.186 crm-int-01.torproject.org crm-int-01
 
 # The following lines are desirable for IPv6 capable hosts
 ::1     localhost ip6-localhost ip6-loopback
@@ -12,3 +12,4 @@
 ff02::1 ip6-allnodes
 ff02::2 ip6-allrouters
 ff02::3 ip6-allhosts
+2a01:4f8:fff0:4f:266:37ff:fe4d:f883 crm-int-01.torproject.org crm-int-01

the ipsec tunnel was renumbered and the machines were rebooted (they were not finding each other).

the machines are ready to be tested, we're at step 3.

comment:12 Changed 6 weeks ago by anarcat

i just lowered TTLs to 5 minutes. hiro did the buster upgrade and things seem to work, according to us and GR. there was an ipsec problem that was fixed yesterday (puppet was running even though it couldn't talk to the puppetmaster which reset and broke the ipsec config, fixed by disabling puppet and redoing the config).

we're getting to step 6, in about 1h10m.

comment:13 Changed 6 weeks ago by anarcat

GR did step 6, 7

hiro did step 8 and 9: switched DNS in LDAP, nagios, reran ud-replicate and puppet on puppetmaster, grepped for IP in dns (no match)

i redid part of 1: i had forgotten the DRBD switch.

then we stumbled upon new problems:

  • puppet couldn't run because of post-buster manifest errors (hiro fixed this)
  • redis wasn't listening on the wrong port (anarcat fixed this by restoring the old config from backups)

The redis config was lost somewhere in the process which is somewhat worrisome. Hopefully it's the only thing that was lost.

We're at "post step 9" which is undocumented, but it's basically:

  1. test (GR)
  2. retire old machine (TPA, step 12-13 in the ganeti procedure)

comment:14 Changed 6 weeks ago by anarcat

Resolution: fixed
Status: assignedclosed

the php-fpm configurations were incorrect but were reset in puppet by myself.

i scheduled deletion of the crm* VMs on macrum, even though the host itself will be retired as part of #33082

we are all done here, thanks to everyone for your help!

Note: See TracTickets for help on using tickets.