Opened 5 months ago

Closed 5 months ago

#33958 closed defect (implemented)

fsn VMs lost connectivity this morning

Reported by: weasel Owned by: hiro
Priority: High Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Major Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This morning several of our VMs at fsn were without network.

The instances were still running, and gnt-console still got me a console that I could log into, but the machines were not reachable from the network, nor could they reach the network. tcpdumping the bridge interface on the node did not show any network traffic for the instance.

Migrating them made them be online again (tried with vineale for instance). Rebooting also helped (tried with everything else).

Looking at the running openswitch config on a node when its instances did not have network looked like this:

root@fsn-node-04:~# ovs-vsctl show
ce[...]
    Bridge "br0"
        Port vlan-gntinet
            tag: 4000
            Interface vlan-gntinet
                type: internal
        Port "eth0"
            Interface "eth0"
        Port "br0"
            Interface "br0"
                type: internal
        Port vlan-gntbe
            tag: 4001
            Interface vlan-gntbe
                type: internal
    ovs_version: "2.10.1"

When its working, it should look more like this:

root@fsn-node-04:~# ovs-vsctl show
ce[...]
    Bridge "br0"
        Port "tap3"
            tag: 4000
            trunks: [4000]
            Interface "tap3"
        Port vlan-gntinet
            tag: 4000
            Interface vlan-gntinet
                type: internal
        Port "eth0"
            Interface "eth0"
        Port "tap4"
            tag: 4000
            trunks: [4000]
            Interface "tap4"
        Port "br0"
            Interface "br0"
                type: internal
        Port "tap5"
            tag: 4000
            trunks: [4000]
            Interface "tap5"
        Port "tap1"
            tag: 4000
            trunks: [4000]
            Interface "tap1"
        Port vlan-gntbe
            tag: 4001
            Interface vlan-gntbe
                type: internal
        Port "tap2"
            tag: 4000
            trunks: [4000]
            Interface "tap2"
        Port "tap0"
            tag: 4000
            trunks: [4000]
            Interface "tap0"
    ovs_version: "2.10.1"

My first guess was that migrating somehow had screwed up the network config, but that's probably not what happened, as the issue happened again shortly afterwards when I was running upgrades. So:

My current working theory is that the following happened:

  • In the morning, once automaticallly and once manually, we ran package upgrades.
  • Today this included an openssl update. And openvswitch is linked against openssl.
  • needrestart restarted openvswitch.
  • restarting openvswitch does not restore the dynamically added VM taps into the bridge.

I propose we blacklist openvswitch from being restarted by needrestart.

Child Tickets

Change History (3)

comment:1 Changed 5 months ago by anarcat

Owner: changed from tpa to anarcat
Status: newaccepted

I propose we blacklist openvswitch from being restarted by needrestart.

Agreed. hiro, want to look into that?

comment:2 Changed 5 months ago by hiro

Owner: changed from anarcat to hiro
Status: acceptedassigned

comment:3 Changed 5 months ago by hiro

Resolution: implemented
Status: assignedclosed

I have blacklisted openswitch in needrestart. Closing for now.

Note: See TracTickets for help on using tickets.