Opened 6 months ago

Closed 5 months ago

#33098 closed defect (fixed)

fsn-node-03 disk problems

Reported by: anarcat Owned by: anarcat
Priority: High Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Blocker Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

for some reason, the HDD disk on fsn-node-03 is having SMART errors. I originally filed this ticket with Hetzner:

yesterday, as we got errors from the SMART daemon on this host, looking like this:

From: root <root@…>
Subject: SMART error (ErrorCount) detected on host: fsn-node-03
To: root@…
Date: Tue, 28 Jan 2020 23:35:35 +0000

This message was generated by the smartd daemon running on:

host name: fsn-node-03
DNS domain: torproject.org

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], ATA error count increased from 0 to 1

Device info:
TOSHIBA MG06ACA10TEY, S/N:..., WWN:...., FW:0103, 10.0 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

Another such email triggered an hour later as well.

The RAID array the disk is on triggered a rebuild as well, somehow. The follow
messages showed up in dmesg:

[Jan28 20:44] md: resync of RAID array md2
[Jan28 22:20] ata2.00: exception Emask 0x50 SAct 0x4000 SErr 0x480900 action 0x6
frozen
[ +0.004419] ata2.00: irq_stat 0x08000000, interface fatal error
[ +0.001489] ata2: SError: { UnrecovData HostInt 10B8B Handshk }
[ +0.000781] ata2.00: failed command: WRITE FPDMA QUEUED
[ +0.000785] ata2.00: cmd 61/00:70:80:52:f6/05:00:ec:00:00/40 tag 14 ncq dma
655360 out

res 40/00:70:80:52:f6/00:00:ec:00:00/40 Emask 0x50 (ATA bus

error)
[ +0.001600] ata2.00: status: { DRDY }
[ +0.000801] ata2: hard resetting link
[ +0.310126] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ +0.088155] ata2.00: configured for UDMA/133
[ +0.000031] ata2: EH complete
[Jan28 23:27] ata1.00: exception Emask 0x50 SAct 0x1c00 SErr 0x280900 action 0x6
frozen
[ +0.004338] ata1.00: irq_stat 0x08000000, interface fatal error
[ +0.001815] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
[ +0.000772] ata1.00: failed command: READ FPDMA QUEUED
[ +0.000738] ata1.00: cmd 60/00:50:00:3b:b1/05:00:47:01:00/40 tag 10 ncq dma
655360 in

res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask 0x50 (ATA bus

error)
[ +0.001512] ata1.00: status: { DRDY }
[ +0.000793] ata1.00: failed command: READ FPDMA QUEUED
[ +0.000727] ata1.00: cmd 60/00:58:00:40:b1/05:00:47:01:00/40 tag 11 ncq dma
655360 in

res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask 0x50 (ATA bus

error)
[ +0.001534] ata1.00: status: { DRDY }
[ +0.000769] ata1.00: failed command: READ FPDMA QUEUED
[ +0.000720] ata1.00: cmd 60/00:60:00:45:b1/01:00:47:01:00/40 tag 12 ncq dma
131072 in

res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask 0x50 (ATA bus

error)
[ +0.001453] ata1.00: status: { DRDY }
[ +0.000778] ata1: hard resetting link
[ +0.556198] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ +0.001780] ata1.00: configured for UDMA/133
[ +0.000037] ata1: EH complete
[Jan28 23:32] perf: interrupt took too long (2518 > 2500), lowering
kernel.perf_event_max_sample_rate to 79250
[Jan29 00:14] ata2.00: exception Emask 0x50 SAct 0x1c000000 SErr 0x480900 action
0x6 frozen
[ +0.004173] ata2.00: irq_stat 0x08000000, interface fatal error
[ +0.001996] ata2: SError: { UnrecovData HostInt 10B8B Handshk }
[ +0.000737] ata2.00: failed command: WRITE FPDMA QUEUED
[ +0.000729] ata2.00: cmd 61/00:d0:00:62:0e/05:00:86:01:00/40 tag 26 ncq dma
655360 out

res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask 0x50 (ATA bus

error)
[ +0.001486] ata2.00: status: { DRDY }
[ +0.000854] ata2.00: failed command: WRITE FPDMA QUEUED
[ +0.000718] ata2.00: cmd 61/00:d8:00:67:0e/05:00:86:01:00/40 tag 27 ncq dma
655360 out

res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask 0x50 (ATA bus

error)
[ +0.001478] ata2.00: status: { DRDY }
[ +0.000884] ata2.00: failed command: WRITE FPDMA QUEUED
[ +0.000736] ata2.00: cmd 61/00:e0:00:6c:0e/01:00:86:01:00/40 tag 28 ncq dma
131072 out

res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask 0x50 (ATA bus

error)
[ +0.001453] ata2.00: status: { DRDY }
[ +0.000760] ata2: hard resetting link
[ +0.000011] ata1.00: exception Emask 0x50 SAct 0x10000000 SErr 0x280900 action
0x6 frozen
[ +0.000764] ata1.00: irq_stat 0x08000000, interface fatal error
[ +0.000725] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
[ +0.000712] ata1.00: failed command: READ FPDMA QUEUED
[ +0.000700] ata1.00: cmd 60/80:e0:00:6d:0e/04:00:86:01:00/40 tag 28 ncq dma
589824 in

res 40/00:e0:00:6d:0e/00:00:86:01:00/40 Emask 0x50 (ATA bus

error)
[ +0.001426] ata1.0...

I lost the original message as hetzner trims replys, but it also included the smartctl -x output of the drive, now lost.

40 minutes later, the drive was replaced and the machine booted again.

We had trouble with the /dev/md2 array: for some reason it wouldn't autostart after the intervention. I started it by hand, rebuilt the initrd and rebooted, to no avail.

I tried to repartition the new sda drive they added, then added it to the array, which started syncing.

But after a while, the error came back:

[Jan29 18:30] ata1.00: exception Emask 0x50 SAct 0x80080 SErr 0x480900 action 0x6 frozen
 [  +0.000020] ata1.00: irq_stat 0x08000000, interface fatal error
 [  +0.000010] ata1: SError: { UnrecovData HostInt 10B8B Handshk }
 [  +0.000012] ata1.00: failed command: READ FPDMA QUEUED
 [  +0.000018] ata1.00: cmd 60/20:38:00:98:04/00:00:00:00:00/40 tag 7 ncq dma 16384 in
                        res 40/00:98:00:e2:ff/00:00:0e:01:00/40 Emask 0x50 (ATA bus error)
 [  +0.000021] ata1.00: status: { DRDY }
 [  +0.000010] ata1.00: failed command: WRITE FPDMA QUEUED
 [  +0.000015] ata1.00: cmd 61/00:98:00:e2:ff/05:00:0e:01:00/40 tag 19 ncq dma 655360 out
                        res 40/00:98:00:e2:ff/00:00:0e:01:00/40 Emask 0x50 (ATA bus error)
 [  +0.000012] ata1.00: status: { DRDY }
 [  +0.000009] ata1: hard resetting link
 [  +0.311884] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 [  +0.049673] ata1.00: configured for UDMA/133
 [  +0.000023] ata1: EH complete

and smartd sent us another email about:

Device: /dev/sda [SAT], ATA error count increased from 0 to 1

i reopened the ticket with hetzner, which will do another visit to the server shortly. they also find it strange the error came back, and suspect something might be wrong with the SATA cables.

Child Tickets

Change History (9)

comment:1 Changed 6 months ago by anarcat

they tested the drive (probably a smart short self-test) and didn't find anything. they swapped the cable and gave us back the box.

after boot, the raid array *again* did not come up. i restarted it and re-added the extra drive. during the sync, we got a ATA error again, but this time it seems the drives didn't notice because smartd didn't send us email. i only noticed it through a dmesg tail:

[Jan29 20:00] microcode: microcode updated early to revision 0xca, date = 2019-10-03
[  +0.000000] Linux version 4.19.0-6-amd64 (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11)
[...]
[Jan29 20:18] md/raid1:md2: active with 1 out of 2 mirrors
[  +0.050102] md2: detected capacity change from 0 to 10000693985280
[Jan29 20:19] md: recovery of RAID array md2
[...]
[Jan29 22:33] ata1.00: exception Emask 0x50 SAct 0xe000 SErr 0x480900 action 0x6 frozen
[  +0.000014] ata1.00: irq_stat 0x08000000, interface fatal error
[  +0.000007] ata1: SError: { UnrecovData HostInt 10B8B Handshk }
[  +0.000009] ata1.00: failed command: WRITE FPDMA QUEUED
[  +0.000013] ata1.00: cmd 61/00:68:80:0c:c4/08:00:b8:00:00/40 tag 13 ncq dma 1048576 ou
                       res 40/00:68:80:0c:c4/00:00:b8:00:00/40 Emask 0x50 (ATA bus error)
[  +0.000011] ata1.00: status: { DRDY }
[  +0.000005] ata1.00: failed command: WRITE FPDMA QUEUED
[  +0.000012] ata1.00: cmd 61/00:70:80:14:c4/03:00:b8:00:00/40 tag 14 ncq dma 393216 out
                       res 40/00:68:80:0c:c4/00:00:b8:00:00/40 Emask 0x50 (ATA bus error)
[  +0.000011] ata1.00: status: { DRDY }
[  +0.000006] ata1.00: failed command: WRITE FPDMA QUEUED
[  +0.000011] ata1.00: cmd 61/80:78:80:17:c4/04:00:b8:00:00/40 tag 15 ncq dma 589824 out
                       res 40/00:68:80:0c:c4/00:00:b8:00:00/40 Emask 0x50 (ATA bus error)
[  +0.000011] ata1.00: status: { DRDY }
[  +0.000008] ata1: hard resetting link
[  +0.375942] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  +0.001613] ata1.00: configured for UDMA/133
[  +0.000026] ata1: EH complete

comment:2 Changed 6 months ago by anarcat

and we got more SMART email messages about this server. now it's also sdb that's complaining. ]

i also noticed that sdb has been complaining even before i opened that ticket with Hetzner. in fact, what triggered me to open that ticket is the second smartd email, which i mistakenly thought was caused by sda errors: the second email we got was about sdb! so changing sda wouldn't have solved that problem.

i commented on hetzner's ticket with the following:

We're still having trouble with this server.

After a full RAID-1 resync, I rebooted the box, but the new disk was
kicked out of the array, and not detected as having a RAID superblock:

root@fsn-node-03:~# mdadm -E /dev/sda1
mdadm: No md superblock detected on /dev/sda1.

When I started the array and readded the disk, it started a full resync
again:

root@fsn-node-03:~# mdadm --run /dev/md2
mdadm: started array /dev/md/2
root@fsn-node-03:~# mdadm /dev/md2 -a /dev/sda1
mdadm: added /dev/sda1
root@fsn-node-03:~# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 sda1[2] sdb1[1]
      9766302720 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.0% (274048/9766302720) finish=593.9min speed=274048K/sec
      bitmap: 0/73 pages [0KB], 65536KB chunk

md1 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
      937026560 blocks super 1.2 [2/2] [UU]
      bitmap: 1/7 pages [4KB], 65536KB chunk

md0 : active raid1 nvme0n1p2[1] nvme1n1p2[0]
      523264 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

Furthermore, I just noticed that have received smartd notifications about the *OTHER*
hard drive (sdb):

Date: Wed, 29 Jan 2020 23:46:38 +0000

[...]

Device: /dev/sdb [SAT], ATA error count increased from 4 to 5

Device info:
TOSHIBA MG06ACA10TEY, S/N:[...], WWN:[...], FW:0103, 10.0 TB

We have also seen errors from sdb, the second drive, *before* we opened
this ticket. That was my mistake: I thought the errors were both from
the same disk, I couldn't imagine both disks were giving out errors.

At this point, I am wondering if it might not be better to just
commission a completely new machine than trying to revive this one. I
get the strong sense something is wrong with the disk controller on that
one. We have two other PX62 servers with the same identical setup
(fsn-node-01/PX62-NVMe #[...], fsn-node-02/PX62-NVMe #[...]).
Both are in production and neither show the same disk problems.

In any case, I can't use the box like this: its (software) RAID array
doesn't survive reboots which tells me there's something very wrong with
this machine.

Could you look into this again please?

So I think that, worst case, they just swap the machine and we reinstall.

comment:3 Changed 6 months ago by anarcat

hetzner moved the NVMe drives into a new server and are setting up new HDDs in there as well.

comment:4 Changed 6 months ago by anarcat

new HDDs installed, but the box doesn't get past cryptsetup. for some reason the initrd is giving me this on the SSH prompt:

Error: Timeout reached while waiting for askpass.

and mandos doesn't seem to be able to bring back the machine online either.

rebooting in rescue, and starting the dance of good old:

cryptsetup luksOpen /dev/md1 crypt_dev_md1
vgchange -a y 
mount /dev/vg_ganeti/root /mnt
mount /dev/md0 /mnt/boot
for dev in dev sys proc run ; do mount -o bind /$dev /mnt/$dev ; done 
hostname fsn-node-03
chroot /mnt

comment:5 Changed 6 months ago by anarcat

aaand the problem was that an NVMe drive was missing! not sure why it wouldn't boot properly, but an update-initramfs later and it would boot correctly. ping'd hetzner about that again.

comment:6 Changed 6 months ago by anarcat

they rechecked the disks and everything is back online. readded the NVMe disks to the array, which is already rebuilt, and rebuilt a md2 array on the HDDs, which will be done in ~13h.

still todo: restore vg_ganeti_hdd over LUKS.

comment:7 Changed 6 months ago by anarcat

still todo: restore vg_ganeti_hdd over LUKS.

done. let's see how the resync goes now.

comment:8 Changed 6 months ago by anarcat

Status: assignedmerge_ready

resync seems to be going fine, i'll do one last reboot in about an hour and then we're clear here.

comment:9 Changed 5 months ago by anarcat

Resolution: fixed
Status: merge_readyclosed

box rebooted fine last week, disk arrays are all sync'd and work fine (no smartd error), we're done with the disk problems here hopefully.

Note: See TracTickets for help on using tickets.