Opened 4 weeks ago

Closed 3 weeks ago

#31805 closed defect (fixed)

fsn-node-02 unstability issues

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

fsn-node-02 seems to have problems staying up. it crashed once yesterday at ~13:00EDT and again today (twice) at 13:34 and 14:48.

I opened the following ticket with Hetzner:

we have had problems with this host during the week. it's the second time now that we
had to do a hard reset. network would first hang, then the controller would be reset
by the kernel, with a pattern like this:

Sep 17 06:26:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:26:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:26:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:26:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:26:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:26:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Reset adapter unexpectedly
Sep 17 06:26:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e: eth0 NIC
Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Sep 17 06:56:44 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:56:44 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:56:44 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:56:44 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:56:44 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:56:44 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Reset adapter unexpectedly
Sep 17 06:56:44 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e: eth0 NIC
Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Sep 17 06:57:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:57:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:57:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Detected Hardware Unit Hang:
Sep 17 06:57:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e 0000:00:1f.6
eth0: Reset adapter unexpectedly
Sep 17 06:57:18 fsn-node-02/fsn-node-02/::ffff:88.198.8.87 kernel: e1000e: eth0 NIC
Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

This seems to happen more or less randomly. Eventually, the entire server becomes
unreachable and only a hard reset would restore it to a proper state. We only have
those logs because they are sent to an external server.

They annoyingly stripped out part of that request so I lost part of it. But basically I asked them to investigate this as a hard problem.

Child Tickets

Change History (7)

comment:1 Changed 4 weeks ago by anarcat

hetzner replied by recommending this workaround:

https://wiki.hetzner.de/index.php/Low_performance_with_Intel_i218/i219_NIC/en

I ran the following command on fsn2 at 14:49EDT today:

ethtool -K eth0 tso off gso off

i'm going to wait and see if it fixes the problem. if the machine crashes again, we'll know that didn't fix the problem.

if it doesn't, then this should be added to a pre-up in /etc/network/interfaces before the next reboot.

Last edited 4 weeks ago by anarcat (previous) (diff)

comment:2 Changed 4 weeks ago by weasel

auth.log suggests that might be 18:47Z.

comment:3 Changed 4 weeks ago by weasel

And so it seems the node has not yet crashed again

comment:4 Changed 4 weeks ago by anarcat

seems much more stable now, indeed. i just rebooted the box which cleared the workaround, so we'll see if the bug returns, in which case we should do a cold reset again ("control-alt-delete" in hetzner's robot thing) and install the workaround permanently in /etc/network/interfaces.

comment:5 Changed 3 weeks ago by weasel

I wanted to migrate instances from node-02 to 01, and that kept failing.

Disabling segmentation offloading with ethtool -K eth0 tso off gso off made it work.

I added a stanza that runs this to network/interfaces.

comment:6 Changed 3 weeks ago by weasel

Also did that on fsn-node-01.

comment:7 Changed 3 weeks ago by anarcat

Resolution: fixed
Status: assignedclosed

awesome, so that workaround worked! i'll close this ticket and hetzner's.

Note: See TracTickets for help on using tickets.