public:stopdayactivities_5jun2018

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
public:stopdayactivities_5jun2018 [2018-05-14 07:53] – [Stop-day activities June 5-6, 2018] Reinoud Bokhorstpublic:stopdayactivities_5jun2018 [2018-06-05 11:52] (current) – [Stop-day activities June 5-6, 2018] Reinoud Bokhorst
Line 4: Line 4:
  
 ^ Coordinator | Reinoud Bokhorst | roadmin@astron.nl | ^ Coordinator | Reinoud Bokhorst | roadmin@astron.nl |
-^ Software Support | | softwaresupport@astron.nl | +^ Software Support | Arno Schoenmakers | softwaresupport@astron.nl | 
-^ Science, Operations and Support |  | sos@astron.nl | +^ Science, Operations and Support | Pietro Zucca | sos@astron.nl | 
-^ Observer | | observer@astron.nl |+^ Observer | Henk Mulder | observer@astron.nl |
  
  
-[[engineering:stop_day_procedures|More stopday details]] +⇒ [[engineering:stop_day_procedures|Description of stopday procedures]]\\ 
- +⇒ [[https://docs.google.com/spreadsheets/d/e/2PACX-1vSEbxNss-nmOofKDXJRmgACwMDB9zeekBLRl39krswsVGIigfvzD_EdnlKJ_2TF-IGgoX2IXvc2YlXL/pubhtml|LOFAR Schedule cycle 10]]\\ 
-[[https://docs.google.com/spreadsheets/d/e/2PACX-1vSEbxNss-nmOofKDXJRmgACwMDB9zeekBLRl39krswsVGIigfvzD_EdnlKJ_2TF-IGgoX2IXvc2YlXL/pubhtml|LOFAR Schedule cycle 10]] +⇒ The next stopday is scheduled for August 7.
 ===== Systems ===== ===== Systems =====
  
Line 19: Line 18:
 ==== Cobalt ==== ==== Cobalt ====
  
-  * ✔ Reboots and idrac reboots. (Hopko/Robin+  * ✔ Reboots and idrac reboots. (Hopko) 
-  * ✔ CBM010 will be present before the stopday+ 
 ==== CEP3 ==== ==== CEP3 ====
  
   * ✔ Block access at 08:00 (Teun)   * ✔ Block access at 08:00 (Teun)
-  * ✔ Mount DAC cables to headnodes to support floating IP (Hopko/Robin) +  * ✔ All nodes: file system check and reboot. (Kees)
-  * ✔ All nodes: file system check and reboot. (Hopko, Robin) +
-  * ✔ Was broken: Check/debug persistence of Slurm reservations (Reinoud)  **For the review**: We should have a backup in place! +
-  * ✔ NFS mounts for cep3, from all the lof-nodes are using control!+
  
  
 ==== CEP4 ==== ==== CEP4 ====
  
-  * ✔ Connect DAC cable (Hopko)+  * ✔ Reboot (Hopko) 
 +  * ✔ Recreate Docker thinpools on CPU nodes 
 +  * ✔ Recabling of Infiniband, details in Jira ticket 
 +  * Performance tests after recabling
  
 ==== LEXARS ==== ==== LEXARS ====
  
-  *  ✔ lexar003 reboot using XCAT (148 days up) (Was done by Reinoud 9-4-2018)+  *  ✔ Reboot (Hopko/Robin) 
  
 ==== LCU ==== ==== LCU ====
  
-  *  ✔ Reboot of all Dutch LCU's (teun) (ILT stations in local mode) **For the review**: We discovered that all LCU's have a fixed NFS mount on /home. This should be an automount.+  *  
  
 ==== Central Services ==== ==== Central Services ====
    
-  * ✔ Update portals to CentOS/KVM +  * ✔ Restart qpidd@ccu001 (ref. https://support.astron.nl/jira/browse/ROADMT-99) 
-  * ✔ Update and reboot lcs020 .. lcs30 +  * ✔ Test DMZ KVM Failover  (DMZ KVM Hypervisor hosts DMZ services (portal,dns server,smtp,proxy etc)) 
-  * ✔ Remove sas001 and sas099 (powered offdisconnect cables) +  * ✔ OS upgrade and reboot
-  * ✔ Update & reboot NFS server lcs115 +
-  * ✔ Almost all nfs mounts on the lcs115 nfs server are over the control network. Only a few correctly use the offline network: only lexar003lexar004and lhd002. We should force CEP3 to use off-line! +
-  * ✔ Mainly MAC/SAS, LCU'and Aartfaac machines use NFS over control VLAN. They don't have a off-line of on-line network connection. +
-  * Check resolv.conf settings; see https://support.astron.nl/lofar_issuetracker/issues/10448+
 ==== LTA ==== ==== LTA ====
  
-  * ✔ Update and reboot when required (Reinoud) +  * ✔ Update and reboot (Reinoud) 
 +  * ✔ Migration of Oracle DB to new hardware (Andrey Tsyganov)
 ==== Aartfaac ==== ==== Aartfaac ====
  
-  * ✘ Check for broken disks **Fail**: ais007 has a degraded RAID1  !! +  * ✔ Check for broken disks **Fail**: ais007 had a degraded RAID1, but a controller firmware update helped. 
-  * **For review**: Add Jasmin & Reinoud to all nodes as admin+ 
 + 
 ==== Core switches ==== ==== Core switches ====
  
-  * none (probably June)+  * ✔ Warm reset PD0, RD0 and RD1 (Arjen)
  
  
-==== Communication issues ==== 
  
-**For review**: At the end of the 1st day software support needs to report status to coordinator 
 ===== Software updates ===== ===== Software updates =====
  
 ==== MoM and related ==== ==== MoM and related ====
  
-  * ?+  * none
  
 ==== MAC/SAS ==== ==== MAC/SAS ====
  
-  * ?+  * none
  
 ==== CEP3 ==== ==== CEP3 ====
  
-  * ✔ Reboot / fs checks +  * none
-  * ✔ Make AOFlagger 2.10 the default version (already installed) +
-  * ✔ Make LOFAR-Release-3_0_14 the default version (linked against AOFlagger 2.10) +
-  * ✔ Make WSClean 2.5 the default version (already installed)+
  
-==== CEP4 ====+==== LCU ====
  
-  * ?+  * synchronize Python packages, see list in ticket 
 +  * ✔ umask change for foreign stations
  
 +==== CEP4 ====
 +
 +  * Rollout Docker images
 +  * ✘ SLURM upgrade  (postponed)
 ==== Aartfaac ==== ==== Aartfaac ====
  
Line 102: Line 101:
 ===== In the field ===== ===== In the field =====
  
-  * none +  * 
- +
- +
-===== External =====+
  
-  * ? 
  
-==== Next stopday ==== 
  
-The next stopday is TBD 
  • Last modified: 2018-05-14 07:53
  • by Reinoud Bokhorst