Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
dragnet:cluster_support [2017-08-17 23:39] – [Failing Components] minor typo and readability fixes amesfoort | dragnet:cluster_support [2017-08-29 11:09] (current) – [Failing Components] add smartd restart ansible command amesfoort | ||
---|---|---|---|
Line 8: | Line 8: | ||
==== Failing Components ==== | ==== Failing Components ==== | ||
- | After initial post-delivery replacements, | + | After initial post-delivery replacements, |
- | Our vendor wants defective hard drives to be reported via their support | + | For up to 20 drives, our vendor wants defective |
+ | (instead of by e-mail). | ||
+ | For the service we did so far, we don't need service credits | ||
+ | Log in, then: | ||
+ | * Click on one of the support call links at the bottom of the page. E.g. for hard drives: | ||
+ | * Verify shipping address for return parts, then fill in: | ||
+ | * ClusterVision machine tag: 150037 | ||
+ | * Your machine reference: drg02 drg12 drg16 | ||
+ | * Hard Disk Brand: TOSHIBA | ||
+ | * Hard Disk Model: MC04ACA400E | ||
+ | * Hard Disk Serialnumbers: | ||
+ | * Number of Hard Disks with Failures: 3 | ||
+ | * Linux Kernel Version: '' | ||
+ | Hardware Settings: | ||
+ | | ||
+ | - Is the disk part of an array on a dedicated hardware RAID controller?: | ||
+ | - Have you tried to fully re-provision the software onto the disk?: yes | ||
+ | Fault Description - Please use this part for all disks.: | ||
- | Other defective components can be reported via the support site or by e-mail. Please also provide some " | + | [root@drg02 ~]$ smartctl |
+ | smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-229.el7.x86_64] (local build) | ||
+ | Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org | ||
+ | |||
+ | === START OF INFORMATION SECTION === | ||
+ | Device Model: | ||
+ | [...] | ||
+ | Error 647 occurred at disk power-on lifetime: 7064 hours (294 days + 8 hours) | ||
+ | When the command | ||
+ | |||
+ | After command completion occurred, registers were: | ||
+ | ER ST SC SN CL CH DH | ||
+ | -- -- -- -- -- -- -- | ||
+ | 40 41 98 70 ea 28 40 Error: UNC at LBA = 0x0028ea70 = 2681456 | ||
+ | |||
+ | | ||
+ | CR FR SC SN CL CH DH DC | ||
+ | -- -- -- -- -- -- -- -- ---------------- | ||
+ | 60 08 98 70 ea 28 40 00 20d+14: | ||
+ | ef 90 03 00 00 00 a0 00 20d+14: | ||
+ | ef 10 02 00 00 00 a0 00 20d+14: | ||
+ | 27 00 00 00 00 00 e0 00 20d+14: | ||
+ | ec 00 00 00 00 00 a0 00 20d+14: | ||
+ | |||
+ | Error 646 occurred at disk power-on lifetime: 7064 hours (294 days + 8 hours) | ||
+ | When the command that caused the error occurred, the device was active or idle. | ||
+ | [...] | ||
+ | |||
+ | [SMART info for each drive] | ||
Line 35: | Line 80: | ||
You may continue working, send the same command to the next drive, or even reboot. The drive will continue when no other requests are there. An extended test may take many hours. Afterwards, you can show the S.M.A.R.T. information again to see the result, but if a failure threshold was exceeded, our '' | You may continue working, send the same command to the next drive, or even reboot. The drive will continue when no other requests are there. An extended test may take many hours. Afterwards, you can show the S.M.A.R.T. information again to see the result, but if a failure threshold was exceeded, our '' | ||
+ | To run a sudo command for all drives on all '' | ||
+ | #!/bin/sh | ||
+ | hostname; smartctl --test=long /dev/sda; smartctl --test=long /dev/sdb; smartctl --test=long /dev/sdc; smartctl --test=long /dev/sdd | ||
+ | mark it executable and run it via ansible as superuser: | ||
+ | $ chmod 755 $HOME/ | ||
+ | $ ansible workers -b -K -f 25 -a ' | ||
+ | then type your password once. | ||
+ | |||
+ | To restart all smartd services, run: | ||
+ | $ ansible alldragnet -b -K -f 25 -a ' | ||
== Upon Drive Failure == | == Upon Drive Failure == | ||
Line 60: | Line 115: | ||
==== Shipping and Return ==== | ==== Shipping and Return ==== | ||
If accepted, our vendor generally sends a replacement within 1 working day, provided they have replacements in stock. They also send a UPS label for easy return shipping of the defective component (to their return address in Amsterdam). | If accepted, our vendor generally sends a replacement within 1 working day, provided they have replacements in stock. They also send a UPS label for easy return shipping of the defective component (to their return address in Amsterdam). | ||
+ | |||
+ | Work with Derk Kuipers or someone else at the ASTRON stockroom to send (and pick up) packages. If you have no (valid) UPS label, get a blank label from the reception. | ||
Line 67: | Line 124: | ||
For disk replacement, | For disk replacement, | ||
+ | The '' | ||
==== Overview Examples of mdadm and lsblk ==== | ==== Overview Examples of mdadm and lsblk ==== | ||
- | How it should look on drgXXnodes, example on drg22 from Aug 2017): | + | How it should look on drgXX nodes, example on drg22 from Aug 2017): |
[amesfoort@drg22 ~]$ sudo mdadm --detail /dev/md1 /dev/md2 | [amesfoort@drg22 ~]$ sudo mdadm --detail /dev/md1 /dev/md2 |