fsn-node-03 disk problems

for some reason, the HDD disk on fsn-node-03 is having SMART errors. I originally filed this ticket with Hetzner:

yesterday, as we got errors from the SMART daemon on this host, looking like this:

From: root root@fsn-node-03.torproject.org Subject: SMART error (ErrorCount) detected on host: fsn-node-03 To: root@fsn-node-03.torproject.org Date: Tue, 28 Jan 2020 23:35:35 +0000

This message was generated by the smartd daemon running on:

host name: fsn-node-03 DNS domain: torproject.org

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], ATA error count increased from 0 to 1

Device info: TOSHIBA MG06ACA10TEY, S/N:..., WWN:...., FW:0103, 10.0 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation. Another message will be sent in 24 hours if the problem persists.

Another such email triggered an hour later as well.

The RAID array the disk is on triggered a rebuild as well, somehow. The follow messages showed up in dmesg:

[Jan28 20:44] md: resync of RAID array md2 [Jan28 22:20] ata2.00: exception Emask 0x50 SAct 0x4000 SErr 0x480900 action 0x6 frozen [ +0.004419] ata2.00: irq_stat 0x08000000, interface fatal error [ +0.001489] ata2: SError: { UnrecovData HostInt 10B8B Handshk } [ +0.000781] ata2.00: failed command: WRITE FPDMA QUEUED [ +0.000785] ata2.00: cmd 61/00:70:80:52:f6/05:00:ec:00:00/40 tag 14 ncq dma 655360 out res 40/00:70:80:52:f6/00:00:ec:00:00/40 Emask 0x50 (ATA bus error) [ +0.001600] ata2.00: status: { DRDY } [ +0.000801] ata2: hard resetting link [ +0.310126] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ +0.088155] ata2.00: configured for UDMA/133 [ +0.000031] ata2: EH complete [Jan28 23:27] ata1.00: exception Emask 0x50 SAct 0x1c00 SErr 0x280900 action 0x6 frozen [ +0.004338] ata1.00: irq_stat 0x08000000, interface fatal error [ +0.001815] ata1: SError: { UnrecovData HostInt 10B8B BadCRC } [ +0.000772] ata1.00: failed command: READ FPDMA QUEUED [ +0.000738] ata1.00: cmd 60/00:50:00:3b:b1/05:00:47:01:00/40 tag 10 ncq dma 655360 in res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask 0x50 (ATA bus error) [ +0.001512] ata1.00: status: { DRDY } [ +0.000793] ata1.00: failed command: READ FPDMA QUEUED [ +0.000727] ata1.00: cmd 60/00:58:00:40:b1/05:00:47:01:00/40 tag 11 ncq dma 655360 in res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask 0x50 (ATA bus error) [ +0.001534] ata1.00: status: { DRDY } [ +0.000769] ata1.00: failed command: READ FPDMA QUEUED [ +0.000720] ata1.00: cmd 60/00:60:00:45:b1/01:00:47:01:00/40 tag 12 ncq dma 131072 in res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask 0x50 (ATA bus error) [ +0.001453] ata1.00: status: { DRDY } [ +0.000778] ata1: hard resetting link [ +0.556198] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ +0.001780] ata1.00: configured for UDMA/133 [ +0.000037] ata1: EH complete [Jan28 23:32] perf: interrupt took too long (2518 > 2500), lowering kernel.perf_event_max_sample_rate to 79250 [Jan29 00:14] ata2.00: exception Emask 0x50 SAct 0x1c000000 SErr 0x480900 action 0x6 frozen [ +0.004173] ata2.00: irq_stat 0x08000000, interface fatal error [ +0.001996] ata2: SError: { UnrecovData HostInt 10B8B Handshk } [ +0.000737] ata2.00: failed command: WRITE FPDMA QUEUED [ +0.000729] ata2.00: cmd 61/00:d0:00:62:0e/05:00:86:01:00/40 tag 26 ncq dma 655360 out res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask 0x50 (ATA bus error) [ +0.001486] ata2.00: status: { DRDY } [ +0.000854] ata2.00: failed command: WRITE FPDMA QUEUED [ +0.000718] ata2.00: cmd 61/00:d8:00:67:0e/05:00:86:01:00/40 tag 27 ncq dma 655360 out res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask 0x50 (ATA bus error) [ +0.001478] ata2.00: status: { DRDY } [ +0.000884] ata2.00: failed command: WRITE FPDMA QUEUED [ +0.000736] ata2.00: cmd 61/00:e0:00:6c:0e/01:00:86:01:00/40 tag 28 ncq dma 131072 out res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask 0x50 (ATA bus error) [ +0.001453] ata2.00: status: { DRDY } [ +0.000760] ata2: hard resetting link [ +0.000011] ata1.00: exception Emask 0x50 SAct 0x10000000 SErr 0x280900 action 0x6 frozen [ +0.000764] ata1.00: irq_stat 0x08000000, interface fatal error [ +0.000725] ata1: SError: { UnrecovData HostInt 10B8B BadCRC } [ +0.000712] ata1.00: failed command: READ FPDMA QUEUED [ +0.000700] ata1.00: cmd 60/80:e0:00:6d:0e/04:00:86:01:00/40 tag 28 ncq dma 589824 in res 40/00:e0:00:6d:0e/00:00:86:01:00/40 Emask 0x50 (ATA bus error) [ +0.001426] ata1.0...

I lost the original message as hetzner trims replys, but it also included the smartctl -x output of the drive, now lost.

40 minutes later, the drive was replaced and the machine booted again.

We had trouble with the /dev/md2 array: for some reason it wouldn't autostart after the intervention. I started it by hand, rebuilt the initrd and rebooted, to no avail.

I tried to repartition the new sda drive they added, then added it to the array, which started syncing.

But after a while, the error came back:

[Jan29 18:30] ata1.00: exception Emask 0x50 SAct 0x80080 SErr 0x480900 action 0x6 frozen
 [  +0.000020] ata1.00: irq_stat 0x08000000, interface fatal error
 [  +0.000010] ata1: SError: { UnrecovData HostInt 10B8B Handshk }
 [  +0.000012] ata1.00: failed command: READ FPDMA QUEUED
 [  +0.000018] ata1.00: cmd 60/20:38:00:98:04/00:00:00:00:00/40 tag 7 ncq dma 16384 in
                        res 40/00:98:00:e2:ff/00:00:0e:01:00/40 Emask 0x50 (ATA bus error)
 [  +0.000021] ata1.00: status: { DRDY }
 [  +0.000010] ata1.00: failed command: WRITE FPDMA QUEUED
 [  +0.000015] ata1.00: cmd 61/00:98:00:e2:ff/05:00:0e:01:00/40 tag 19 ncq dma 655360 out
                        res 40/00:98:00:e2:ff/00:00:0e:01:00/40 Emask 0x50 (ATA bus error)
 [  +0.000012] ata1.00: status: { DRDY }
 [  +0.000009] ata1: hard resetting link
 [  +0.311884] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 [  +0.049673] ata1.00: configured for UDMA/133
 [  +0.000023] ata1: EH complete

and smartd sent us another email about:

Device: /dev/sda [SAT], ATA error count increased from 0 to 1

i reopened the ticket with hetzner, which will do another visit to the server shortly. they also find it strange the error came back, and suspect something might be wrong with the SATA cables.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information