Opened 7 months ago

Closed 7 months ago

Last modified 7 months ago

#33684 closed defect (fixed)

smartd ignores nvme devices

Reported by: anarcat Owned by: anarcat
Priority: Immediate Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Major Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

during the fsn-node-05 setup (#33083), i noticed an error in systemd because smartd wouldn't start. the error was:

Mar 21 21:06:38 fsn-node-05.torproject.org smartd[5390]: smartd 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)
Mar 21 21:06:38 fsn-node-05.torproject.org smartd[5390]: Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Mar 21 21:06:38 fsn-node-05.torproject.org smartd[5390]: Opened configuration file /etc/smartd.conf
Mar 21 21:06:38 fsn-node-05.torproject.org smartd[5390]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
Mar 21 21:06:38 fsn-node-05.torproject.org smartd[5390]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Mar 21 21:06:38 fsn-node-05.torproject.org systemd[1]: smartd.service: Main process exited, code=exited, status=17/n/a
Mar 21 21:06:38 fsn-node-05.torproject.org smartd[5390]: DEVICESCAN failed: glob(3) aborted matching pattern /dev/discs/disc*
Mar 21 21:06:38 fsn-node-05.torproject.org systemd[1]: smartd.service: Failed with result 'exit-code'.
Mar 21 21:06:38 fsn-node-05.torproject.org smartd[5390]: In the system's table of devices NO devices found to scan
Mar 21 21:06:38 fsn-node-05.torproject.org smartd[5390]: Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...

yet on all the other ganeti nodes, it doesn't have the same problem, for example on fsn-node-03:

Mar 21 21:07:29 fsn-node-03 smartd[4826]: smartd 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Opened configuration file /etc/smartd.conf
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sda [SAT], opened
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sda [SAT], XXXX
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sda [SAT], not found in smartd database.
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.XXXXX
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sdb, type changed from 'scsi' to 'sat'
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sdb [SAT], opened
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sdb [SAT], XXXX
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sdb [SAT], found in smartd database: HGST Ultrastar He10
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.XXXX
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Monitoring 2 ATA/SATA, 0 SCSI/SAS and 0 NVMe devices
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.XXXXX
Mar 21 21:07:29 fsn-node-03 smartd[4826]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.XXXXXX

It's then I noticed this critical line on fsn-node-03:

Mar 21 21:07:29 fsn-node-03 smartd[4826]: Monitoring 2 ATA/SATA, 0 SCSI/SAS and 0 NVMe devices

.. which explains what smartd is crashing on fsn-node-05: it *only* has nvme drives, so smartd is upset because it doesn't have drives to monitor.

as the status message above implies, smartd *does* have support for NVMe, so this should work out of the box. it just doesn't find the drives.

Child Tickets

Change History (3)

comment:1 Changed 7 months ago by anarcat

Resolution: fixed
Status: assignedclosed

i fixed this on what i believe are the only machines with NVMe drives, with:

cumin 'C:roles::ganeti::fsn' 'apt install -t buster-backports smartmontools'

puppet has also been tweaked to get the right package version installed on first install, which should cover further problems.

i'm assuming that unattended-upgrades will just do the right thing for the rest of this.

comment:2 Changed 7 months ago by anarcat

Component: - Select a componentInternal Services/Tor Sysadmin Team
Priority: MediumImmediate
Severity: NormalMajor

unattended-upgrades thing mentioned in #31957

comment:3 Changed 7 months ago by anarcat

turns out other nodes have nvme drives, so i did:

cumin-all 'ls /dev/nvme* && apt install -t buster-backports smartmontools'

... but of course this failed on macrum and kvm4... considering those nodes will migrate "shortly" (hopefully!), i think it's fine to leave them in the dark, especially since there's no backported version for them...

Last edited 7 months ago by anarcat (previous) (diff)
Note: See TracTickets for help on using tickets.