Differences

This shows you the differences between two versions of the page.

--- dragnet:cluster_usage [2017-06-13 12:43] – [Data Copying] amesfoort
+++ dragnet:cluster_usage [2019-01-07 15:06] (current) – [Access and Login] Reinoud Bokhorst
@@ Line 7: / Line 7: @@
 ===== Access and Login =====
-To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels@astron.nl'').\\
+To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels[AT]astron[DOT]nl'').\\
-Easiest is to ask him to send his permission to Teun Grit (''grit@astron.nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior(''sipior@astron.nl'') to add your account to DRAGNET.\\
+Easiest is to ask him to send his permission to the RO Sysadmins (''roadmin[AT]astron[DOT]nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior (''sipior[AT]astron[DOT]nl'') to add your account to DRAGNET.\\
-You can also provide Teun Grit your (e.g. home) IP(s) to add to a LOFAR portal white list if needed.
+You can also provide RO Admin your (e.g. home) IP(s) to add to a LOFAR portal white list if needed.
 Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu'') (or tunnel through it):
@@ Line 58: / Line 58: @@
 Type ''module help'' for a list of ''module'' commands.
-List of available modules (June 2017):
+List of available modules (July 2017):
   $ module avail
@@ Line 65: / Line 65: @@
   ---------------------------------------------------------------------------- /etc/modulefiles -----------------------------------------------------------------------------
-  aoflagger/2.8.0    casa/4.7           casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.20.0       lofardal/current   wsclean/1.12
+  aoflagger/2.8.0    casa/4.7           casacore/2.0.3     casarest/current   cuda/current       lofar/2.20.0       lofardal/2.5.0     srm/2.6.28         wsclean/current
-  aoflagger/2.9.0    casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.21.1       mpi/mpich-x86_64   wsclean/2.2.1
+  aoflagger/2.9.0    casa/5.0           casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.21.1       lofardal/current   wsclean/1.12
-  aoflagger/current  casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.16.2       lofar/current      mpi/openmpi-x86_64 wsclean/2.4
+  aoflagger/current  casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.21.5       mpi/mpich-x86_64   wsclean/2.2.1
-  casa/4.6           casacore/2.0.3     casarest/current   cuda/current       lofar/2.17.5       lofardal/2.5.0     srm/2.6.28         wsclean/current
+  casa/4.6           casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.17.5       lofar/current      mpi/openmpi-x86_64 wsclean/2.4
 Add latest lofar module to your env:
@@ Line 225: / Line 225: @@
   http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/...
-//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos@astron.nl> (sigh...)\\
+//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos[AT]astron[DOT]nl> (sigh...)\\
 Put that username and password in a ''.wgetrc'' file in your home directory:
   proxy-user=yourusername
@@ Line 329: / Line 329: @@
 Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\
-This can also be exectured by users for their own jobs.
+This can also be executed by users for their own jobs.
   $ scontrol resume 100
   $ scontrol resume [1000,2000]
+==== SLURM Troubleshooting ====
+== "Undraining" nodes ==
+If you expect that there should be enough resources, but slurm submission fails because some nodes could be in "drain" state, you can check that by running "sinfo". You could see
+something like this, where nodes drg06 and drg08 are in drain state:
+  $ sinfo
+  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
+  workers*     up   infinite      2  drain drg[06,08]
+  workers*     up   infinite      1    mix drg01
+  workers*     up   infinite     21   idle dragproc,drg[02-05,07,09-23]
+  head         up   infinite      1   idle dragnet
+To "undrain" e.g. drg08, you can do:
+  $ scontrol update NodeName=drg08 State=DOWN Reason="undraining"
+  $ scontrol update NodeName=drg08 State=RESUME