Opened 7 months ago

Closed 7 months ago

Last modified 7 months ago

#29817 closed defect (fixed)

dead disk on moly

Reported by: anarcat Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

one of the hard drives on moly has died. this was spotted by cymru's staff and confirmed when smartd was installed (#29709).

i have done some research on the machine to figure out what's up, and wrote the following reply to Cymru's people:

[...] I can confirm that one of the hard drives in Moly has failed, according to SMART metrics we have available.

According to smartd, that disk is:

[SEAGATE ST3600057SS 0008], lu id: 0x5000c5003b5bc36f, S/N: 6SL1G7Q60000N1497K0E, 600 GB

It's a 600GB SAS drive. It's part of a megaraid RAID-10 array that has marked the drive as "Firmware state: Failed". I'll go under the assertiont his means the drive is dead.

Being new here, I'm not familiar with the machine either. From what I can tell, it's a Supermicro X8DTU motherboard, and possibly an iXsystems iX1204-R700UB case. Does it look like this this picture?

https://static.ixsystems.co/uploads/2017/08/1204h-t_front.png

If so, the only datasheet I could find is this limited PDF:

https://www.ixsystems.com/wp-content/uploads/2017/09/Server_Line_2017_WEB.pdf

It *does* say the hard drives are hot-swappable, so in theory, it should just be a matter of replacing the hard drive.

It looks like each drive has its own LED, hopefully the one with the amber warning light should be the dead disk. I've issued a command to the RAID controller to make it "flash" the drive LED, so hopefully that will allow you to locate it better.

I *think* the disk controller is new enough for you to simply hot swap the drive with a new one without any other intervention on our part. But it might be better if we are available during the operation. [...]

I've created some documentation on the hardware RAID stuff here:

https://help.torproject.org/tsa/howto/raid/

we're at the waiting step now - we'll see if Cymru can do the replacement and when. i'm still not quite certain we can just hotswap the drive, but I'm hoping we can.

Child Tickets

Change History (5)

comment:1 Changed 7 months ago by anarcat

cymru responded that they can change the drive if we ship it. there are few options for the drive... if we stick to only Newegg as a seller, we basically find nothing at all. but allowing for newegg *resellers*, we find this cute little drive at 80$USD

https://www.newegg.com/Product/Product.aspx?Item=N82E16822148617

It's the exact same model as the failing drive so it should work nicely. The question is whether "TEKDEALZ" is trustworthy.

Amazon also seems to have the drive:

https://www.amazon.com/Seagate-Cheetah-15000RPM-Internal-ST3600057SS/dp/B002P4J3YI

... but it's only sold as refurbished. (Although the Newegg/TEKDEALZ one might also be refurb and Amazon is just more honest about it.)

there's still the open question of whether the disk controller is new enough to support an unattended disk swap or whether we need to coordinate with cymru. but that can wait until they actually get the drive and everything...

comment:2 Changed 7 months ago by gk

Component: - Select a componentInternal Services/Tor Sysadmin Team

comment:3 Changed 7 months ago by anarcat

i asked jon to order the disk from newegg and ship it.

comment:4 Changed 7 months ago by anarcat

moly's drive will be replaced on monday, 1300-2200GMT.

comment:5 Changed 7 months ago by anarcat

Resolution: fixed
Status: assignedclosed

the disk ended up being replaced on wednesday by cymru's remote hands, but unfortunately the machine didn't come back up after reboot.

we weren't able to access the BIOS with the IPMI interface, neither through ipmi-console which i installed on peninsulare and should provide a SOL (Serial Over LAN) interface, nor through IPMIview which weasel was able to run in an older jessie (or wheezy?) VM. it seems there is configuration missing in the BIOS to redirect it to the serial console and a similar configuration is therefore likely missing from GRUB and the kernel as well.

in the end our precious remote hands were able to reboot the machine by simply hitting the keyboard. it seems the BIOS was configured to hang on boot if a new drive was inserted in the machine, another BIOS configuration which might be nice to disable, to say the least.

lessons learned:

  1. BIOS configuration should be standardized to a certain set of parameters to avoid those problems in the future (boot without interruption, console redirection, etc)
  2. GRUB also needs a similar configuration
  3. we should test serial consoles before rebooting. in this case, the machine might have been able to have the drive swapped without a reboot if we were worried about the reboot
  4. this machine could possibly be retired as it is getting to its 8th year anniversary (see #29974 for followup)
Last edited 7 months ago by anarcat (previous) (diff)
Note: See TracTickets for help on using tickets.