Opened 5 months ago

Last modified 4 months ago

#34185 assigned defect

ganeti clusters don't like automatic upgrades

Reported by: anarcat Owned by: hiro
Priority: High Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Major Keywords: tpa-roadmap-june
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

this weekend the ganeti cluster had a partial outage: nodes were reachable, but networking was broken on all instances.

this is, presumably, because of the Debian buster point release that occured on saturday (!). last time this happened, weasel identified openvswitch as the culprit, and hiro deployed a fix that would make it survive such situations. but either something else came up or the fix didn't work, because the problem happened again this weekend.

i fixed it by rebooting all nodes forcibly (without migrating first).

Child Tickets

Change History (7)

comment:1 Changed 5 months ago by anarcat

This is the mail I sent on sunday:

There was a ~8h ganeti outage until about now. It seems the buster point release broke things in our automated upgrade procedure. I didn't have time to diagnose the issue (I was running out) and figured it was more urgent to restore the service.

I rebooted all gnt-fsn nodes by hand (without migrating). Some instances returned with a state of "ERROR_down", so I manually started them (with gnt-instance start). Everything now seems to be back up.

I haven't looked at Nagios in details, but everything is mostly "yellow" now so I'll assume we're good.

It would be great if someone could look at the logs and see what happened. I suspect the openvswitch fix didn't work, or maybe there are other servers we need to block from needrestart's automation (or maybe even unattended-upgrades).

comment:2 Changed 5 months ago by hiro

On the 10th of may there was an unattended upgrade. The kernel was updated and the system restarted.
Openvswithch was updated and restarted so maybe the blacklist didn't work.

According to the unattended upgrades logs the following packages were handled by need restart:

Restarting services...
 /etc/needrestart/restart.d/dbus.service
 systemctl restart apt-daily-upgrade.service ganeti.service smartmontools.service ssh.service strongswan.service syslog-ng.service systemd-logind.service unattended-upgrades.service unbound.service

comment:3 Changed 5 months ago by hiro

Openvswitch was updated together with the following group of packages:

2020-05-10 06:12:53,754 INFO Packages that will be upgraded: base-files distro-info-data iputils-arping iputils-ping iputils-tracepath libbrlapi0.6 libfuse2 l
ibpam-systemd libsystemd0 libudev1 linux-compiler-gcc-8-x86 linux-headers-amd64 linux-image-amd64 linux-kbuild-4.19 openvswitch-common openvswitch-switch post
fix postfix-cdb rake rubygems-integration systemd systemd-sysv tzdata udev

Checking openvswitch status it has not been restarted since the 10th of may:

Loaded: loaded (/lib/systemd/system/openvswitch-switch.service; enabled; vendor preset: enabled)
   Active: active (exited) since Sun 2020-05-10 14:05:11 UTC; 2 weeks 3 days ago

And from the log on that day I actually see it died twice:

2020-05-10T06:13:16.534Z|00003|vlog(monitor)|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-05-10T06:13:16.534Z|00004|daemon_unix(monitor)|INFO|pid 3211 died, exit status 0, exiting
2020-05-10T06:13:16.787Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-05-10T06:13:16.788Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2020-05-10T06:13:16.788Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores
2020-05-10T06:13:16.788Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-05-10T06:13:16.788Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-05-10T06:13:16.791Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.1
2020-05-10T06:13:17.332Z|00002|daemon_unix(monitor)|INFO|pid 29781 died, exit status 0, exiting
2020-05-10T06:13:17.621Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-05-10T06:13:17.623Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2020-05-10T06:13:17.623Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores
2020-05-10T06:13:17.623Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-05-10T06:13:17.623Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-05-10T06:13:17.630Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.1

...

2020-05-10T14:02:23.078Z|00036|bridge|INFO|bridge br0: using datapath ID 00007eb83553f345
2020-05-10T14:02:23.398Z|00037|bridge|INFO|bridge br0: deleted interface br0 on port 65534
2020-05-10T14:02:23.578Z|00002|daemon_unix(monitor)|INFO|pid 29951 died, exit status 0, exiting
2020-05-10T14:05:05.241Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-05-10T14:05:05.247Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2020-05-10T14:05:05.247Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores
2020-05-10T14:05:05.247Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-05-10T14:05:05.247Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-05-10T14:05:05.250Z|00006|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.10.1
2020-05-10T14:05:09.384Z|00007|ofproto_dpif|INFO|system@ovs-system: Datapath supports recirculation
2020-05-10T14:05:09.384Z|00008|ofproto_dpif|INFO|system@ovs-system: VLAN header stack length probed as 2
2020-05-10T14:05:09.384Z|00009|ofproto_dpif|INFO|system@ovs-system: MPLS label stack length probed as 1
2020-05-10T14:05:09.384Z|00010|ofproto_dpif|INFO|system@ovs-system: Datapath supports truncate action
2020-05-10T14:05:09.384Z|00011|ofproto_dpif|INFO|system@ovs-system: Datapath supports unique flow ids

Last edited 5 months ago by hiro (previous) (diff)

comment:4 Changed 5 months ago by hiro

Tested reinstalling openvswitch with

apt install --reinstall openvswitch-switch

On fsn-node-06. It caused openvswitch-switch to restart:

Active: active (exited) since Wed 2020-05-27 17:09:27 UTC; 2min 44s ago

I think openvswitch should be upgraded manually for the time being.

comment:5 Changed 5 months ago by hiro

Migrating VMs between nodes returns the VM back online.

comment:6 Changed 5 months ago by hiro

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746 opened bug with the package yesterday.

Last edited 5 months ago by hiro (previous) (diff)

comment:7 Changed 4 months ago by hiro

Keywords: tpa-roadmap-june added; tpa-roadmap-may removed
Note: See TracTickets for help on using tickets.