public:stopdayactivities_5feb2019

Stop-day activities February 5, 2019


Coordinator Teun Grit roadmin@astron.nl
Software Support softwaresupport@astron.nl
Science, Operations and Support Sarrvesh Sridhar sos@astron.nl
Observer Henk Mulder observer@astron.nl

Description of stopday procedures
LOFAR Schedule cycle 11

  • ✔ Reboots (Hopko)
  • ✔ COBALT2: Connect 10GbE ports to RD-0 and RD-1. (Arjen)
  • ✔ Block access at 08:00 (Teun)
  • ✔ Reboots (Kees)
  • ✔ Update CentOS7.3 except Python3
  • ✔ Reboots
  • None
  • ✔ Update & reboot (lcs116 updated, no reboot)
  • OS upgrade and reboot
  • ✔ Move memory DIMM's from lcs107/108 to lcs119 (4 x 8GB)
  • OS upgrade and reboot
  • ✔ Update & reboot
  • No update & reboots on their request
  • None
  • none
  • none
  • ✔ Wsclean & IDG
  • ✔ LoSoTo
  • ✔ RMextract
  • None
  • None
  • None
  • none
  • none

Present: Reinoud Bokhorst, Henk Mulder, Hopko Meijering (Slack), Teun Grit. By email: Thomas Franzen on behalf of SOS

  1. Hopko: CEP4 cpu30 warning was disappeared after reboot
  2. Reinoud: Cobalt1 test script complained about kis001. Cobalt1 checks not complete. Communication could be better. Solution: Coordinator needs to ask again in Slack.
  3. Reinoud: CEP4 Slurm update went okay. We discovered that Docker containers had also an old version of the Slurm client. We solved it by mounting the client from outside.
  4. Teun: A new script is discovering all CEP3 users of the past 2 months and picks up the email addresses from NIS. This script must be used to warn CEP3 users a week ahead about the upcoming stop-days. It can also be used to inform those users when the stop-day finishes. The script find_email_addr.sh is hosted by lhdhead.
  5. Reinoud: Some supervisor daemons were not stopped by Software Support.
  6. Teun: It would be good to have a IB status check in Zabbix for Cobalt2
  7. (Next from the email by Thomas)
  8. All items on the SOS checklist were completed except for 'Check data recorded on DRAGNET'. The test observation ran but no data were recorded on drg10. This is because of missing software. There is a ticket addressing this issue, assigned to Jorrit: https://support.astron.nl/jira/projects/ROHD/queues/custom/65/ROHD-759 I don’t understand why the status of this ticket is ‘done’. I will remind Jorrit.
  9. In the future, the stop day coordinator should be formally informed when SOS has completed the checklist. An e-mail should be sent by SOS to the stop day coordinator, with SDOS and ROAdmin CC'd to ensure that everyone is kept in the loop.
  10. The SLURM version mismatch was not detected during the stop day because 'test pre-processing pipeline’ was not on the SOS stop day checklist. This has now been added to the checklist.
  11. My understanding is that ROAdmin will keep a CEP3 users e-mail list up-to-date. They will inform the CEP3 users when the system will be unavailable during stop days and software roll-outs, and also when the system is back online again. Let me know if I have misunderstood this. Comment by Teun: It is the responsibility of the stop-day coordinator that the CEP3 users are well informed. He will create the list of email addresses and send it to the SOS person on duty for that stop-day and this SOS person will send the email out (cc to coordinator). The coordinator verifies this.
  • Last modified: 2019-02-19 13:19
  • by grit