Stop-day activities February 5, 2019


Coordinator Teun Grit roadmin@astron.nl
Software Support softwaresupport@astron.nl
Science, Operations and Support Sarrvesh Sridhar sos@astron.nl
Observer Henk Mulder observer@astron.nl

Description of stopday procedures
LOFAR Schedule cycle 11

Cobalt

  • ✔ Reboots (Hopko)
  • ✔ COBALT2: Connect 10GbE ports to RD-0 and RD-1. (Arjen)

CEP3

  • ✔ Block access at 08:00 (Teun)
  • ✔ Reboots (Kees)
  • ✔ Update CentOS7.3 except Python3

CEP4

LEXARS

  • ✔ Reboots

LCU

  • None

Portals

  • ✔ Update & reboot (lcs116 updated, no reboot)

Central Services lcs020 .. lcs030

  • OS upgrade and reboot
  • ✔ Move memory DIMM's from lcs107/108 to lcs119 (4 x 8GB)

Other Central Services

  • OS upgrade and reboot

LTA

  • ✔ Update & reboot

Aartfaac

  • No update & reboots on their request

Core switches

  • None
  • none

MAC/SAS

  • none

CEP3

  • ✔ Wsclean & IDG
  • ✔ LoSoTo
  • ✔ RMextract

LCU

  • None

CEP4

  • None

Aartfaac

  • None

COBALT

  • none

LTA

  • none

DWG Lofar Systems

Review meeting

Present: Reinoud Bokhorst, Henk Mulder, Hopko Meijering (Slack), Teun Grit. By email: Thomas Franzen on behalf of SOS

  1. Hopko: CEP4 cpu30 warning was disappeared after reboot
  2. Reinoud: Cobalt1 test script complained about kis001. Cobalt1 checks not complete. Communication could be better. Solution: Coordinator needs to ask again in Slack.
  3. Reinoud: CEP4 Slurm update went okay. We discovered that Docker containers had also an old version of the Slurm client. We solved it by mounting the client from outside.
  4. Teun: A new script is discovering all CEP3 users of the past 2 months and picks up the email addresses from NIS. This script must be used to warn CEP3 users a week ahead about the upcoming stop-days. It can also be used to inform those users when the stop-day finishes. The script find_email_addr.sh is hosted by lhdhead.
  5. Reinoud: Some supervisor daemons were not stopped by Software Support.
  6. Teun: It would be good to have a IB status check in Zabbix for Cobalt2
  7. (Next from the email by Thomas)
  8. All items on the SOS checklist were completed except for 'Check data recorded on DRAGNET'. The test observation ran but no data were recorded on drg10. This is because of missing software. There is a ticket addressing this issue, assigned to Jorrit: https://support.astron.nl/jira/projects/ROHD/queues/custom/65/ROHD-759 I don’t understand why the status of this ticket is ‘done’. I will remind Jorrit.
  9. In the future, the stop day coordinator should be formally informed when SOS has completed the checklist. An e-mail should be sent by SOS to the stop day coordinator, with SDOS and ROAdmin CC'd to ensure that everyone is kept in the loop.
  10. The SLURM version mismatch was not detected during the stop day because 'test pre-processing pipeline’ was not on the SOS stop day checklist. This has now been added to the checklist.
  11. My understanding is that ROAdmin will keep a CEP3 users e-mail list up-to-date. They will inform the CEP3 users when the system will be unavailable during stop days and software roll-outs, and also when the system is back online again. Let me know if I have misunderstood this. Comment by Teun: It is the responsibility of the stop-day coordinator that the CEP3 users are well informed. He will create the list of email addresses and send it to the SOS person on duty for that stop-day and this SOS person will send the email out (cc to coordinator). The coordinator verifies this.
  • Last modified: 2019-02-19 13:19
  • by Teun Grit