dragnet:cluster_support

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
dragnet:cluster_support [2017-08-17 23:24] – created amesfoortdragnet:cluster_support [2017-08-29 11:09] (current) – [Failing Components] add smartd restart ansible command amesfoort
Line 8: Line 8:
  
 ==== Failing Components ==== ==== Failing Components ====
-After initial post-delivery replacements, hard drives fail the most often (as expected). As of writing (Aug 2017), we have been through 3 replacement calls (only a few drives each time) + 1 upcoming.+After initial post-delivery replacements, hard drives fail the most often (as expected). As of writing (Aug 2017), we have been through 3 replacement calls (only a few drives each time) + 1 upcoming. Other defective components can be reported via the support site or by e-mail. Please also provide some "proof" that it is defective, or discuss with them how to do that (possibly after reading their service support Tips/Troubleshooting/... pages).
  
-Our vendor wants defective hard drives to be reported via their support site (instead of by e-mail). This applies to up to 20(?drives per eventFor drives the site asks you to enter some info (serial nretc), among them failure reportYou can use a failing S.M.A.R.Tcopy-paste report.+For up to 20 drives, our vendor wants defective //hard drives// to be reported via their [[https://service.clustervision.com/ | service site]] 
 + (instead of by e-mail). 
 +For the service we did so far, we don't need service credits (as long as our support contract lasts).\\ 
 +Log in, then: 
 +  * Click on one of the support call links at the bottom of the page. E.g. for hard drives: 
 +  * Verify shipping address for return parts, then fill in: 
 +  * ClusterVision machine tag: 150037 (I'm entirely unsure about thisbut it doesn't matter.) 
 +  * Your machine reference: drg02 drg12 drg16 
 +  * Hard Disk Brand: TOSHIBA 
 +  * Hard Disk Model: MC04ACA400E 
 +  * Hard Disk Serialnumbers: 55I7K0XUFLSA 55IAK0N0FLSA 55IAK0NEFLSA (example) 
 +  * Number of Hard Disks with Failures: 3 
 +  * Linux Kernel Version: ''uname -a'' command output, e.g.: Linux drg02 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux 
 +Hardware Settings: 
 +  Has the BIOS configuration been set to AHCI ?: yes 
 +  - Is the disk part of an array on a dedicated hardware RAID controller?: no (drgXX nodes), or yes (dragnet or dragproc node) 
 +  - Have you tried to fully re-provision the software onto the disk?: yes 
 +Fault Description - Please use this part for all disks.:
  
-Other defective components can be reported via the support site or by e-mailPlease also provide some "proof" that it is defective, or discuss with them how to do that (possibly after reading their FAQ on how to do that).+  [root@drg02 ~]$ smartctl --all /dev/sdb 
 +  smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-229.el7.x86_64] (local build) 
 +  Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org 
 +   
 +  === START OF INFORMATION SECTION === 
 +  Device Model:     TOSHIBA MC04ACA400E 
 +  [...] 
 +  Error 647 occurred at disk power-on lifetime: 7064 hours (294 days + 8 hours) 
 +     When the command that caused the error occurredthe device was active or idle. 
 +   
 +     After command completion occurred, registers were: 
 +     ER ST SC SN CL CH DH 
 +     -- -- -- -- -- -- -- 
 +     40 41 98 70 ea 28 40  Error: UNC at LBA = 0x0028ea70 = 2681456 
 +   
 +     Commands leading to the command that caused the error were: 
 +     CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name 
 +     -- -- -- -- -- -- -- --  ----------------  -------------------- 
 +     60 08 98 70 ea 28 40 00  20d+14:42:31.049  READ FPDMA QUEUED 
 +     ef 90 03 00 00 00 a0 00  20d+14:42:31.047  SET FEATURES [Disable SATA feature] 
 +     ef 10 02 00 00 00 a0 00  20d+14:42:31.047  SET FEATURES [Enable SATA feature] 
 +     27 00 00 00 00 00 e0 00  20d+14:42:31.047  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] 
 +     ec 00 00 00 00 00 a0 00  20d+14:42:31.046  IDENTIFY DEVICE 
 +   
 +  Error 646 occurred at disk power-on lifetime: 7064 hours (294 days + 8 hours) 
 +     When the command that caused the error occurred, the device was active or idle. 
 +  [...
 +   
 +  [SMART info for each drive]
  
  
Line 35: Line 80:
 You may continue working, send the same command to the next drive, or even reboot. The drive will continue when no other requests are there. An extended test may take many hours. Afterwards, you can show the S.M.A.R.T. information again to see the result, but if a failure threshold was exceeded, our ''smartd'' service also sends an e-mail. You may continue working, send the same command to the next drive, or even reboot. The drive will continue when no other requests are there. An extended test may take many hours. Afterwards, you can show the S.M.A.R.T. information again to see the result, but if a failure threshold was exceeded, our ''smartd'' service also sends an e-mail.
  
 +To run a sudo command for all drives on all ''drgXX'' nodes, make a script (ansible -a only takes trivial commands), such as:
 +  #!/bin/sh
 +  hostname; smartctl --test=long /dev/sda; smartctl --test=long /dev/sdb; smartctl --test=long /dev/sdc; smartctl --test=long /dev/sdd
 +mark it executable and run it via ansible as superuser:
 +  $ chmod 755 $HOME/yourscript.sh
 +  $ ansible workers -b -K -f 25 -a '$HOME/yourscript.sh'  # + possibly redirecting stdout and/or stderr
 +then type your password once.
 +
 +To restart all smartd services, run:
 +  $ ansible alldragnet -b -K -f 25 -a 'systemctl restart smartd'
  
 == Upon Drive Failure == == Upon Drive Failure ==
Line 48: Line 103:
 //Important//: All drives have a Device Model nr and a Serial Number, etc. Ensure you (Cpt. obvious): //Important//: All drives have a Device Model nr and a Serial Number, etc. Ensure you (Cpt. obvious):
   * first report the right IDs + copy-paste the S.M.A.R.T. logs from the right drive   * first report the right IDs + copy-paste the S.M.A.R.T. logs from the right drive
-  * then replace the right drive in the data center(!) (ask Mike Sipior how to flash drive lights, etc)+  * then replace the right drive in the data center(!) (ask Mike Sipior how to flash the right drive lights, etc)
   * then return ship the drive(s) with the reported ID(s)!   * then return ship the drive(s) with the reported ID(s)!
  
 Then you need to: Then you need to:
-  * restore the broken RAID (on drgXX) (ask Mike Sipior, or use ''mdadm'' command (result shown below), then instruct kernel to reload partition tables using the ''partprobe'' command)+  * restore the broken RAID (on drgXX) (ask Mike Sipior, or use the ''mdadm'' command (display of healthy state shown below), then instruct kernel to reload partition tables using the ''partprobe'' command)
   * remount the partition (/dev/mdX to /data[12])   * remount the partition (/dev/mdX to /data[12])
   * copy back the backed up data   * copy back the backed up data
Line 60: Line 115:
 ==== Shipping and Return ==== ==== Shipping and Return ====
 If accepted, our vendor generally sends a replacement within 1 working day, provided they have replacements in stock. They also send a UPS label for easy return shipping of the defective component (to their return address in Amsterdam). If accepted, our vendor generally sends a replacement within 1 working day, provided they have replacements in stock. They also send a UPS label for easy return shipping of the defective component (to their return address in Amsterdam).
 +
 +Work with Derk Kuipers or someone else at the ASTRON stockroom to send (and pick up) packages. If you have no (valid) UPS label, get a blank label from the reception.
  
  
Line 67: Line 124:
 For disk replacement, one person is enough. If the node has to be opened up, 2 persons are needed (sliding rails issue + node weight + limited space in data center). For disk replacement, one person is enough. If the node has to be opened up, 2 persons are needed (sliding rails issue + node weight + limited space in data center).
  
 +The ''ledmon'' package (''sudo yum install ledmon'') provide ''ledctl'' which can be used to turn off and on status lights on the disk caddies. Use ''ledctl locate=/dev/sdd'' or ''ledctl locate_off=/dev/sdd'' to turn on or off the lights of the device you want to change.
  
 ==== Overview Examples of mdadm and lsblk ==== ==== Overview Examples of mdadm and lsblk ====
-How it should look on drgXXnodes, example on drg22 from Aug 2017):+How it should look on drgXX nodes, example on drg22 from Aug 2017):
  
   [amesfoort@drg22 ~]$ sudo mdadm --detail /dev/md1 /dev/md2   [amesfoort@drg22 ~]$ sudo mdadm --detail /dev/md1 /dev/md2
  • Last modified: 2017-08-17 23:24
  • by amesfoort