public:stopdayactivities_4dec2018

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
public:stopdayactivities_4dec2018 [2018-12-05 10:35] – [Aartfaac] gritpublic:stopdayactivities_4dec2018 [2018-12-18 12:20] (current) – [CEP4] grit
Line 33: Line 33:
  
   * √ Switch IPoIB to connected mode on head and gpu nodes, see https://support.astron.nl/jira/browse/ROADMT-186   * √ Switch IPoIB to connected mode on head and gpu nodes, see https://support.astron.nl/jira/browse/ROADMT-186
-  * √ The Slurm disk performance tests are planned for after the stopday. (Volume now is 75%)+  * √ The Lustre disk performance tests are planned for after the stopday. (Volume now is 75%)
   * mgmt01.cep4 to CentOS 7.5   * mgmt01.cep4 to CentOS 7.5
   * Robinhood tests   * Robinhood tests
Line 141: Line 141:
 ==== Review meeting ==== ==== Review meeting ====
  
-  *+When: 13-12-2018 11:00 Muller 
 +Present: Arno, Matthijs(Slack), Reinoud, Jasmin, Henk, Teun (coordinator) 
 + 
 + 
 +  - Central: Always start the stop-day with NIS updates. NIS is needed for Slurm and Lexars (ssh tunnel) 
 +  - Cep3:  Slurm needs working accounts when slumrcltd starts, otherwise reservations will get lost! Jasmin suggested to setup another solution for the current password hack using ssh_keys. To be implemented before next stop-day. 
 +  - CEP4:  mgmt01 node is now on CentOS 7.5 
 +  - Lexars: The lexars came up when NIS was down. This caused the ssh-tunnel to be broken.   
 +  - LCU family: Replacement of Rubidium in CN caused confusion and that took quite some time to figure out what was happening. 
 +  - LCU ILT: Not available on stop-day. They were rebooted the next Monday. In future: plan many months in advance, if possible every other stopday.  For the software roll-out the ILT stations always need to be available. 
 +  - Wincc: We need to set up a testsystem first. 
 +  - About Novell IDM: What is the timescale for replacement? The systems can’t be updated anymore. 
 +  - Ldb003: Intel NIC did not work. We used internal NIC’s instead. 
 +  - Lofarlta01 not done. No time left. 
 +  - Dragnet: Are dragnet users aware of the cable change? SOS have skipped the tests, so the Dragnet team needs to do that. Mattijs will inform dragnet@astron.nl.  
 +  - Aartfaac-lcu was upgraded and rebooted the next Wednesday. It had over 500 packages to be installed. It went fine in the end. 
 +  - Network:  The reload at 11:00 surprised us. Teun should have warned his colleagues a few minutes earlier.  
 +  - The Zabbix server crashed during upgrade. Reinstall was needed and that took some hours. Cause unknown. 
 +  - Dwingeloo systems were also updated & rebooted using spacewalk. 
 +  - Scu001 has no NFS mount. Remove it from SDOS checks. 
 +  - Triggered observation test failed. There seems to be a bug in the script. 
 +  - Matthijs: Cep3. Should SOS inform users? Yes, the coordinator will report when accounts are back and Slurm is up. SOS needs to check and inform users thereafter. The SOS checks are still being defined. 
 +  - Network overhaul. Validation run in front of stop-day was not done due to miscommunication. Please wait for acknowledgement.  
  • Last modified: 2018-12-18 12:20
  • by grit