dragnet:cluster_support

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
dragnet:cluster_support [2017-08-18 08:26] – add vendor service site specifics amesfoortdragnet:cluster_support [2017-08-29 11:09] (current) – [Failing Components] add smartd restart ansible command amesfoort
Line 25: Line 25:
 Hardware Settings: Hardware Settings:
   - Has the BIOS configuration been set to AHCI ?: yes   - Has the BIOS configuration been set to AHCI ?: yes
-  - Is the disk part of an array on a dedicated hardware RAID controller?: no+  - Is the disk part of an array on a dedicated hardware RAID controller?: no (drgXX nodes), or yes (dragnet or dragproc node)
   - Have you tried to fully re-provision the software onto the disk?: yes   - Have you tried to fully re-provision the software onto the disk?: yes
 Fault Description - Please use this part for all disks.: Fault Description - Please use this part for all disks.:
Line 80: Line 80:
 You may continue working, send the same command to the next drive, or even reboot. The drive will continue when no other requests are there. An extended test may take many hours. Afterwards, you can show the S.M.A.R.T. information again to see the result, but if a failure threshold was exceeded, our ''smartd'' service also sends an e-mail. You may continue working, send the same command to the next drive, or even reboot. The drive will continue when no other requests are there. An extended test may take many hours. Afterwards, you can show the S.M.A.R.T. information again to see the result, but if a failure threshold was exceeded, our ''smartd'' service also sends an e-mail.
  
 +To run a sudo command for all drives on all ''drgXX'' nodes, make a script (ansible -a only takes trivial commands), such as:
 +  #!/bin/sh
 +  hostname; smartctl --test=long /dev/sda; smartctl --test=long /dev/sdb; smartctl --test=long /dev/sdc; smartctl --test=long /dev/sdd
 +mark it executable and run it via ansible as superuser:
 +  $ chmod 755 $HOME/yourscript.sh
 +  $ ansible workers -b -K -f 25 -a '$HOME/yourscript.sh'  # + possibly redirecting stdout and/or stderr
 +then type your password once.
 +
 +To restart all smartd services, run:
 +  $ ansible alldragnet -b -K -f 25 -a 'systemctl restart smartd'
  
 == Upon Drive Failure == == Upon Drive Failure ==
Line 114: Line 124:
 For disk replacement, one person is enough. If the node has to be opened up, 2 persons are needed (sliding rails issue + node weight + limited space in data center). For disk replacement, one person is enough. If the node has to be opened up, 2 persons are needed (sliding rails issue + node weight + limited space in data center).
  
 +The ''ledmon'' package (''sudo yum install ledmon'') provide ''ledctl'' which can be used to turn off and on status lights on the disk caddies. Use ''ledctl locate=/dev/sdd'' or ''ledctl locate_off=/dev/sdd'' to turn on or off the lights of the device you want to change.
  
 ==== Overview Examples of mdadm and lsblk ==== ==== Overview Examples of mdadm and lsblk ====
  • Last modified: 2017-08-18 08:26
  • by amesfoort