public:stopdayactivities_4dec2018

Stop-day activities December 4-5, 2018


Coordinator Teun Grit roadmin@astron.nl
Software Support Arno Schoenmakers softwaresupport@astron.nl
Science, Operations and Support Matthijs, Pietro sos@astron.nl
Observer Henk Mulder observer@astron.nl

Description of stopday procedures
LOFAR Schedule cycle 11
Stop-day progress sheet (2 tabs!)

  • √ Reboots and idrac reboots. (Hopko)
  • √ Block access at 08:00 (Teun)
  • √ Repair memory bank of lof021 See ticket CIT-25
  • √ Debug slurm lost reservations (Reinoud starts reboots, Robin waits for Reinoud)
  • √ Switch IPoIB to connected mode on head and gpu nodes, see https://support.astron.nl/jira/browse/ROADMT-186
  • √ The Lustre disk performance tests are planned for after the stopday. (Volume now is 75%)
  • mgmt01.cep4 to CentOS 7.5
  • Robinhood tests
  • √ Disable Supervisor on both lexars at 07:45 (Teun)
  • √ Powerdrain of iDracs (Hopko/Robin)
  • √ Update iDrac firmware (Hopko/Robin)
  • √ The lexars stay on CentOS7.2
  • √ Some investigation is needed on the iDrac's (Hopko)

:!: Be ware of the famous ssh tunnel!

  • √ Test reboot script on 1 LCU
  • √ Reboot cn001 (not announced)
  • √ remote stations need reboot
  • Install WinCC 3.16 on 1 LCU. Jasmin: Can't be done on CentOS7.2. We need a 7.5 system. There is a spare LCU available in the Dwnigeloo digital lab (RS511). In the end we need WinCC 3.16 on all LCU's someday.
  • √ Update & reboot
  • √ Check High Availability of portal2
  • OS upgrade and reboot (SLES11_SP4 update contains ~65 packages, incl new kernel)
  • √ Stop and disable supervisor at 07:45 (Teun)
  • OS upgrade and reboot
  • √ Remove Zabbix-agent version 2.2 from scu001
  • √ Start Postgres replication ldb003 → lcs119 and database split (Reinoud)
  • √ Recabling network interfaces and p2p ldb003 / lcs119 (Arjen, please inform Reinoud)
  • ✘ Update & reboot
  • √ Connect IB switch to CEP4 spine switches, instead of the Cobalt switch. The cables are already in. (Hopko)
  • √ Update & reboot ais001-007,
  • √ Update & reboot ads001
  • Update & reboot aartfaac-lcu: No! OpenSuse 13.1 system too far behind!
  • √ Warm reset 11:00h (Arjen)
  • none
  • none
  • None
  • Slurm update needs testing first. GPU04 is available for testing.
  • Jasmin: Is there a repo available? Hopko will check.
  • None
  • none
  • none

When: 13-12-2018 11:00 Muller Present: Arno, Matthijs(Slack), Reinoud, Jasmin, Henk, Teun (coordinator)

  1. Central: Always start the stop-day with NIS updates. NIS is needed for Slurm and Lexars (ssh tunnel)
  2. Cep3: Slurm needs working accounts when slumrcltd starts, otherwise reservations will get lost! Jasmin suggested to setup another solution for the current password hack using ssh_keys. To be implemented before next stop-day.
  3. CEP4: mgmt01 node is now on CentOS 7.5
  4. Lexars: The lexars came up when NIS was down. This caused the ssh-tunnel to be broken.
  5. LCU family: Replacement of Rubidium in CN caused confusion and that took quite some time to figure out what was happening.
  6. LCU ILT: Not available on stop-day. They were rebooted the next Monday. In future: plan many months in advance, if possible every other stopday. For the software roll-out the ILT stations always need to be available.
  7. Wincc: We need to set up a testsystem first.
  8. About Novell IDM: What is the timescale for replacement? The systems can’t be updated anymore.
  9. Ldb003: Intel NIC did not work. We used internal NIC’s instead.
  10. Lofarlta01 not done. No time left.
  11. Dragnet: Are dragnet users aware of the cable change? SOS have skipped the tests, so the Dragnet team needs to do that. Mattijs will inform dragnet@astron.nl.
  12. Aartfaac-lcu was upgraded and rebooted the next Wednesday. It had over 500 packages to be installed. It went fine in the end.
  13. Network: The reload at 11:00 surprised us. Teun should have warned his colleagues a few minutes earlier.
  14. The Zabbix server crashed during upgrade. Reinstall was needed and that took some hours. Cause unknown.
  15. Dwingeloo systems were also updated & rebooted using spacewalk.
  16. Scu001 has no NFS mount. Remove it from SDOS checks.
  17. Triggered observation test failed. There seems to be a bug in the script.
  18. Matthijs: Cep3. Should SOS inform users? Yes, the coordinator will report when accounts are back and Slurm is up. SOS needs to check and inform users thereafter. The SOS checks are still being defined.
  19. Network overhaul. Validation run in front of stop-day was not done due to miscommunication. Please wait for acknowledgement.
  • Last modified: 2018-12-18 12:20
  • by grit