Opened 2 weeks ago

Last modified 20 hours ago

#34185 assigned defect

ganeti clusters don't like automatic upgrades

Reported by: anarcat Owned by: hiro
Priority: High Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Major Keywords: tpa-roadmap-may
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

this weekend the ganeti cluster had a partial outage: nodes were reachable, but networking was broken on all instances.

this is, presumably, because of the Debian buster point release that occured on saturday (!). last time this happened, weasel identified openvswitch as the culprit, and hiro deployed a fix that would make it survive such situations. but either something else came up or the fix didn't work, because the problem happened again this weekend.

i fixed it by rebooting all nodes forcibly (without migrating first).

Child Tickets

Change History (2)

comment:1 Changed 2 weeks ago by anarcat

This is the mail I sent on sunday:

There was a ~8h ganeti outage until about now. It seems the buster point release broke things in our automated upgrade procedure. I didn't have time to diagnose the issue (I was running out) and figured it was more urgent to restore the service.

I rebooted all gnt-fsn nodes by hand (without migrating). Some instances returned with a state of "ERROR_down", so I manually started them (with gnt-instance start). Everything now seems to be back up.

I haven't looked at Nagios in details, but everything is mostly "yellow" now so I'll assume we're good.

It would be great if someone could look at the logs and see what happened. I suspect the openvswitch fix didn't work, or maybe there are other servers we need to block from needrestart's automation (or maybe even unattended-upgrades).

comment:2 Changed 20 hours ago by hiro

On the 10th of may there was an unattended upgrade. The kernel was updated and the system restarted.
Openvswithch was updated and restarted so maybe the blacklist didn't work.

According to the unattended upgrades logs the following packages were handled by need restart:

Restarting services...
 /etc/needrestart/restart.d/dbus.service
 systemctl restart apt-daily-upgrade.service ganeti.service smartmontools.service ssh.service strongswan.service syslog-ng.service systemd-logind.service unattended-upgrades.service unbound.service

Note: See TracTickets for help on using tickets.