Table of Contents

Stop-day activities February 5, 2019


Coordinator Teun Grit roadmin@astron.nl
Software Support softwaresupport@astron.nl
Science, Operations and Support Sarrvesh Sridhar sos@astron.nl
Observer Henk Mulder observer@astron.nl

Description of stopday procedures
LOFAR Schedule cycle 11

Systems

Cobalt

CEP3

CEP4

LEXARS

LCU

Portals

Central Services lcs020 .. lcs030

Other Central Services

LTA

Aartfaac

Core switches

Software updates

MAC/SAS

CEP3

LCU

CEP4

Aartfaac

COBALT

LTA

In the field

DWG Lofar Systems

Review meeting

Present: Reinoud Bokhorst, Henk Mulder, Hopko Meijering (Slack), Teun Grit. By email: Thomas Franzen on behalf of SOS

  1. Hopko: CEP4 cpu30 warning was disappeared after reboot
  2. Reinoud: Cobalt1 test script complained about kis001. Cobalt1 checks not complete. Communication could be better. Solution: Coordinator needs to ask again in Slack.
  3. Reinoud: CEP4 Slurm update went okay. We discovered that Docker containers had also an old version of the Slurm client. We solved it by mounting the client from outside.
  4. Teun: A new script is discovering all CEP3 users of the past 2 months and picks up the email addresses from NIS. This script must be used to warn CEP3 users a week ahead about the upcoming stop-days. It can also be used to inform those users when the stop-day finishes. The script find_email_addr.sh is hosted by lhdhead.
  5. Reinoud: Some supervisor daemons were not stopped by Software Support.
  6. Teun: It would be good to have a IB status check in Zabbix for Cobalt2
  7. (Next from the email by Thomas)
  8. All items on the SOS checklist were completed except for 'Check data recorded on DRAGNET'. The test observation ran but no data were recorded on drg10. This is because of missing software. There is a ticket addressing this issue, assigned to Jorrit: https://support.astron.nl/jira/projects/ROHD/queues/custom/65/ROHD-759 I don’t understand why the status of this ticket is ‘done’. I will remind Jorrit.
  9. In the future, the stop day coordinator should be formally informed when SOS has completed the checklist. An e-mail should be sent by SOS to the stop day coordinator, with SDOS and ROAdmin CC'd to ensure that everyone is kept in the loop.
  10. The SLURM version mismatch was not detected during the stop day because 'test pre-processing pipeline’ was not on the SOS stop day checklist. This has now been added to the checklist.
  11. My understanding is that ROAdmin will keep a CEP3 users e-mail list up-to-date. They will inform the CEP3 users when the system will be unavailable during stop days and software roll-outs, and also when the system is back online again. Let me know if I have misunderstood this. Comment by Teun: It is the responsibility of the stop-day coordinator that the CEP3 users are well informed. He will create the list of email addresses and send it to the SOS person on duty for that stop-day and this SOS person will send the email out (cc to coordinator). The coordinator verifies this.