Differences

This shows you the differences between two versions of the page.

--- dragnet:cluster_usage [2017-06-13 12:47] – [SLURM Cluster Management] fix typo amesfoort
+++ dragnet:cluster_usage [2019-01-07 15:06] (current) – [Access and Login] Reinoud Bokhorst
@@ Line 7: / Line 7: @@
 ===== Access and Login =====
-To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels@astron.nl'').\\
+To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels[AT]astron[DOT]nl'').\\
-Easiest is to ask him to send his permission to Teun Grit (''grit@astron.nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior(''sipior@astron.nl'') to add your account to DRAGNET.\\
+Easiest is to ask him to send his permission to the RO Sysadmins (''roadmin[AT]astron[DOT]nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior (''sipior[AT]astron[DOT]nl'') to add your account to DRAGNET.\\
-You can also provide Teun Grit your (e.g. home) IP(s) to add to a LOFAR portal white list if needed.
+You can also provide RO Admin your (e.g. home) IP(s) to add to a LOFAR portal white list if needed.
 Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu'') (or tunnel through it):
@@ Line 58: / Line 58: @@
 Type ''module help'' for a list of ''module'' commands.
-List of available modules (June 2017):
+List of available modules (July 2017):
   $ module avail
@@ Line 65: / Line 65: @@
   ---------------------------------------------------------------------------- /etc/modulefiles -----------------------------------------------------------------------------
-  aoflagger/2.8.0    casa/4.7           casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.20.0       lofardal/current   wsclean/1.12
+  aoflagger/2.8.0    casa/4.7           casacore/2.0.3     casarest/current   cuda/current       lofar/2.20.0       lofardal/2.5.0     srm/2.6.28         wsclean/current
-  aoflagger/2.9.0    casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.21.1       mpi/mpich-x86_64   wsclean/2.2.1
+  aoflagger/2.9.0    casa/5.0           casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.21.1       lofardal/current   wsclean/1.12
-  aoflagger/current  casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.16.2       lofar/current      mpi/openmpi-x86_64 wsclean/2.4
+  aoflagger/current  casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.21.5       mpi/mpich-x86_64   wsclean/2.2.1
-  casa/4.6           casacore/2.0.3     casarest/current   cuda/current       lofar/2.17.5       lofardal/2.5.0     srm/2.6.28         wsclean/current
+  casa/4.6           casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.17.5       lofar/current      mpi/openmpi-x86_64 wsclean/2.4
 Add latest lofar module to your env:
@@ Line 225: / Line 225: @@
   http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/...
-//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos@astron.nl> (sigh...)\\
+//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos[AT]astron[DOT]nl> (sigh...)\\
 Put that username and password in a ''.wgetrc'' file in your home directory:
   proxy-user=yourusername
@@ Line 332: / Line 332: @@
   $ scontrol resume 100
   $ scontrol resume [1000,2000]
+==== SLURM Troubleshooting ====
+== "Undraining" nodes ==
+If you expect that there should be enough resources, but slurm submission fails because some nodes could be in "drain" state, you can check that by running "sinfo". You could see
+something like this, where nodes drg06 and drg08 are in drain state:
+  $ sinfo
+  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
+  workers*     up   infinite      2  drain drg[06,08]
+  workers*     up   infinite      1    mix drg01
+  workers*     up   infinite     21   idle dragproc,drg[02-05,07,09-23]
+  head         up   infinite      1   idle dragnet
+To "undrain" e.g. drg08, you can do:
+  $ scontrol update NodeName=drg08 State=DOWN Reason="undraining"
+  $ scontrol update NodeName=drg08 State=RESUME