Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
public:stopdayactivities_5feb2019 [2019-01-31 10:37] – [LEXARS] Reinoud Bokhorst | public:stopdayactivities_5feb2019 [2019-02-19 13:19] (current) – [Review meeting] grit | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Stop-day activities | + | ====== Stop-day activities |
\\ | \\ | ||
^ Coordinator | Teun Grit | roadmin@astron.nl | | ^ Coordinator | Teun Grit | roadmin@astron.nl | | ||
- | ^ Software Support | Nico Vermaas | + | ^ Software Support | | softwaresupport@astron.nl | |
^ Science, Operations and Support | Sarrvesh Sridhar | sos@astron.nl | | ^ Science, Operations and Support | Sarrvesh Sridhar | sos@astron.nl | | ||
^ Observer | Henk Mulder | observer@astron.nl | | ^ Observer | Henk Mulder | observer@astron.nl | | ||
Line 19: | Line 19: | ||
==== Cobalt ==== | ==== Cobalt ==== | ||
- | * Reboots | + | * ✔ Reboots (Hopko) |
- | * COBALT2: Connect 10GbE ports to RD-0 and RD-1. (Arjen) | + | * ✔ COBALT2: Connect 10GbE ports to RD-0 and RD-1. (Arjen) |
==== CEP3 ==== | ==== CEP3 ==== | ||
- | * Block access at 08:00 (?) | + | * ✔ Block access at 08:00 (Teun) |
- | * Reboots | + | * ✔ Reboots |
+ | * ✔ Update CentOS7.3 except Python3 | ||
Line 32: | Line 33: | ||
==== CEP4 ==== | ==== CEP4 ==== | ||
- | * Reboots | + | * ✔ Reboots |
+ | * ✔ [[Slurm upgrade to v17.02.2]] (Reinoud/ | ||
==== LEXARS ==== | ==== LEXARS ==== | ||
- | * Reboots | + | * ✔ Reboots |
==== LCU ==== | ==== LCU ==== | ||
Line 44: | Line 46: | ||
==== Portals ==== | ==== Portals ==== | ||
- | * Update & reboot | + | * ✔ Update & reboot |
==== Central Services lcs020 .. lcs030 ==== | ==== Central Services lcs020 .. lcs030 ==== | ||
- | * OS upgrade and reboot | + | * ✔ OS upgrade and reboot |
- | * Move memory DIMM's from lcs107/108 to lcs119 (4 x 8GB) | + | * ✔ Move memory DIMM's from lcs107/108 to lcs119 (4 x 8GB) |
==== Other Central Services ==== | ==== Other Central Services ==== | ||
- | * OS upgrade and reboot | + | * ✔ OS upgrade and reboot |
==== LTA ==== | ==== LTA ==== | ||
- | * None | + | * ✔ Update & reboot |
==== Aartfaac ==== | ==== Aartfaac ==== | ||
- | * None | + | * No update & reboots on their request |
Line 83: | Line 84: | ||
==== CEP3 ==== | ==== CEP3 ==== | ||
- | * None | + | * ✔ Wsclean & IDG |
+ | * ✔ LoSoTo | ||
+ | * ✔ RMextract | ||
==== LCU ==== | ==== LCU ==== | ||
Line 112: | Line 116: | ||
==== Review meeting ==== | ==== Review meeting ==== | ||
- | * | + | Present: Reinoud Bokhorst, Henk Mulder, Hopko Meijering (Slack), Teun Grit. By email: Thomas Franzen on behalf of SOS |
+ | |||
+ | - Hopko: CEP4 cpu30 warning was disappeared after reboot | ||
+ | - Reinoud: Cobalt1 test script complained about kis001. Cobalt1 checks not complete. Communication could be better. Solution: Coordinator needs to ask again in Slack. | ||
+ | - Reinoud: CEP4 Slurm update went okay. We discovered that Docker containers had also an old version of the Slurm client. We solved it by mounting the client from outside. | ||
+ | - Teun: A new script is discovering all CEP3 users of the past 2 months and picks up the email addresses from NIS. This script must be used to warn CEP3 users a week ahead about the upcoming stop-days. It can also be used to inform those users when the stop-day finishes. The script '' | ||
+ | - Reinoud: Some supervisor daemons were not stopped by Software Support. | ||
+ | - Teun: It would be good to have a IB status check in Zabbix for Cobalt2 | ||
+ | - (Next from the email by Thomas) | ||
+ | - All items on the SOS checklist were completed except for 'Check data recorded on DRAGNET' | ||
+ | - In the future, the stop day coordinator should be formally informed when SOS has completed the checklist. An e-mail should be sent by SOS to the stop day coordinator, | ||
+ | - The SLURM version mismatch was not detected during the stop day because 'test pre-processing pipeline’ was not on the SOS stop day checklist. This has now been added to the checklist. | ||
+ | - My understanding is that ROAdmin will keep a CEP3 users e-mail list up-to-date. They will inform the CEP3 users when the system will be unavailable during stop days and software roll-outs, and also when the system is back online again. Let me know if I have misunderstood this. Comment by Teun: It is the responsibility of the stop-day coordinator that the CEP3 users are well informed. He will create the list of email addresses and send it to the SOS person on duty for that stop-day and this SOS person will send the email out (cc to coordinator). The coordinator verifies this. |